Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data

Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA to...

Full description

Saved in:

Bibliographic Details
Published in	Computational statistics Vol. 38; no. 2; pp. 647 - 674
Main Authors	Weisser, Christoph, Gerloff, Christoph, Thielmann, Anton, Python, Andre, Reuter, Arik, Kneib, Thomas, Säfken, Benjamin
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2023 Springer Nature B.V
Subjects	Dirichlet problem Documents Economic Theory/Quantitative Economics/Mathematical Methods Mathematics and Statistics Original Paper Probability and Statistics in Computer Science Probability Theory and Stochastic Processes Simulation Social networks Statistics Latent Dirichlet allocation Covid-19 Topic models Social media Model evaluation Twitter Collapsed Gibbs sampler algorithm for the Dirichlet multinomial model Gamma-Poisson mixture topic model Pseudo-document simulation
Online Access	Get full text
ISSN	0943-4062 1613-9658 1613-9658
DOI	10.1007/s00180-022-01246-z

Cover

More Information
Summary:	Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0943-4062 1613-9658 1613-9658
DOI:	10.1007/s00180-022-01246-z