An Enhanced BERTopic Framework and Algorithm for Improving Topic Coherence and Diversity

In this paper, we enhance and customize the existing BERTopic framework to develop and implement an automated pipeline that delivers a more coherent and diverse set of topics with an even moderate dataset. More specifically, the contributions of this work are threefold: (1) integrate a dynamic and a...

Full description

Saved in:

Bibliographic Details
Published in	2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys) pp. 2251 - 2257
Main Authors	Sawant, Sahil, Yu, Jinhong, Pandya, Kirtikumar, Ngan, Chun-Kit, Bardeli, Rolf
Format	Conference Proceeding
Language	English
Published	IEEE 01.12.2022
Subjects	BERTopic Framework Coherence Context modeling Dimension Reduction Heuristic algorithms Internet K-Means-Based Algorithms Pipelines Topic Modeling Transformers
Online Access	Get full text
DOI	10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00332

Cover

More Information
Summary:	In this paper, we enhance and customize the existing BERTopic framework to develop and implement an automated pipeline that delivers a more coherent and diverse set of topics with an even moderate dataset. More specifically, the contributions of this work are threefold: (1) integrate a dynamic and advanced optimizer into the existing BERTopic framework to learn the optimal number of dimensions of different document embeddings, (2) develop a k-means-based algorithm in the optimizer to support the dimension-embedding learning, and (3) conduct an extensive experimental study on three distinct types of datasets, including DBPedia, AG News, and Reuters, to evaluate the performance of our approach in terms of the topic quality (TQ) score computed by the topic coherence and the topic diversity. From the results, we can conclude that our enhanced, automated BERTopic framework with its dimension-embedding learning algorithm on documents outperforms the TQ score of the existing framework by 4.49% (before removing the stop words) and 16.52% (after removing the stop words) among all the four representable document-embedding approaches, including the BERTopic's Default Sentence Transformer, Google's Universal Sentence Encoder, OpenAI GPT-2, and our investigators' developed Context-aware Embedding Model, on all the three datasets.
DOI:	10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00332