An Enhanced BERTopic Framework and Algorithm for Improving Topic Coherence and Diversity
In this paper, we enhance and customize the existing BERTopic framework to develop and implement an automated pipeline that delivers a more coherent and diverse set of topics with an even moderate dataset. More specifically, the contributions of this work are threefold: (1) integrate a dynamic and a...
Saved in:
| Published in | 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys) pp. 2251 - 2257 |
|---|---|
| Main Authors | , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
01.12.2022
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00332 |
Cover
| Summary: | In this paper, we enhance and customize the existing BERTopic framework to develop and implement an automated pipeline that delivers a more coherent and diverse set of topics with an even moderate dataset. More specifically, the contributions of this work are threefold: (1) integrate a dynamic and advanced optimizer into the existing BERTopic framework to learn the optimal number of dimensions of different document embeddings, (2) develop a k-means-based algorithm in the optimizer to support the dimension-embedding learning, and (3) conduct an extensive experimental study on three distinct types of datasets, including DBPedia, AG News, and Reuters, to evaluate the performance of our approach in terms of the topic quality (TQ) score computed by the topic coherence and the topic diversity. From the results, we can conclude that our enhanced, automated BERTopic framework with its dimension-embedding learning algorithm on documents outperforms the TQ score of the existing framework by 4.49% (before removing the stop words) and 16.52% (after removing the stop words) among all the four representable document-embedding approaches, including the BERTopic's Default Sentence Transformer, Google's Universal Sentence Encoder, OpenAI GPT-2, and our investigators' developed Context-aware Embedding Model, on all the three datasets. |
|---|---|
| DOI: | 10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00332 |