一种数据流主题特征提取方法、装置、设备及存储介质

本发明所提供的数据流主题特征提取方法通过使用词汇表的单词数目不固定的LDA模型，通过使其主题单词分布服从原子数目不固定的狄利克雷过程，而非原子数目固定的狄利克雷分布，使得新模型在遇到未在词汇表中出现的新单词时可以将其加入到词汇表中并继续算法的执行，通过不断地遇到并添加新的单词，实现信息充分利用的同时没有增加内存处理压力，使LDA模型中的词汇表与需要处理的语料更加贴合，提升了模型的精度，增强了在线LDA算法处理数据流的能力。本发明还公开了一种数据流主题特征提取装置、设备及一种可读存储介质，具有上述有益效果。 According to the data flow theme feature ext...

Full description

Saved in:

Bibliographic Details
Format	Patent
Language	Chinese
Published	13.06.2023
Subjects	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online Access	Get full text

Cover

More Information
Summary:	本发明所提供的数据流主题特征提取方法通过使用词汇表的单词数目不固定的LDA模型，通过使其主题单词分布服从原子数目不固定的狄利克雷过程，而非原子数目固定的狄利克雷分布，使得新模型在遇到未在词汇表中出现的新单词时可以将其加入到词汇表中并继续算法的执行，通过不断地遇到并添加新的单词，实现信息充分利用的同时没有增加内存处理压力，使LDA模型中的词汇表与需要处理的语料更加贴合，提升了模型的精度，增强了在线LDA算法处理数据流的能力。本发明还公开了一种数据流主题特征提取装置、设备及一种可读存储介质，具有上述有益效果。 According to the data flow theme feature extraction method provided by the invention, an LDA model with unfixed word number in a vocabulary is used; Subject word distribution complies with a Dirichletprocess with an unfixed atom number but the Dirichlet distribution with a fixed number of atoms; when encountering new words which do not appear in the vocabulary, the new model can add the new wordsinto the vocabulary and the execution of the algorithm is continued; By continuously encountering and adding new words, the information is fully utilized, and meanwhile, the memory processing pressure is not increased, so that the vocabulary in the LDA model is more attached to the corpus needing to be processed, the precision of the model is improved, and the data s
Bibliography:	Application Number: CN201811641140