Research on a data mining algorithm based on BERTopic for medication rules in Traditional Chinese Medicine prescriptions

Background A data mining algorithm is proposed based on BERTopic to provide new insights into the analysis of medication rules in Traditional Chinese Medicine (TCM) prescriptions. Methods Using the BERTopic algorithm, collected TCM prescriptions for corneal diseases are converted to embeddings throu...

Full description

Saved in:
Bibliographic Details
Published inMedicine Advances Vol. 1; no. 4; pp. 353 - 360
Main Authors Li, Hongchen, Lu, Xinyi, Wu, Yujia, Luo, Jie
Format Journal Article
LanguageEnglish
Published Guangzhou John Wiley & Sons, Inc 01.12.2023
Wiley
Subjects
Online AccessGet full text
ISSN2834-443X
2834-4391
2834-4405
2834-4405
DOI10.1002/med4.39

Cover

More Information
Summary:Background A data mining algorithm is proposed based on BERTopic to provide new insights into the analysis of medication rules in Traditional Chinese Medicine (TCM) prescriptions. Methods Using the BERTopic algorithm, collected TCM prescriptions for corneal diseases are converted to embeddings through a transformer based on the Bidirectional Encoder Representations from Transformers pre‐trained model. Then, Uniform Manifold Approximation and Projection is applied to perform dimensionality reduction in prescription embeddings. Subsequently, Hierarchical Density‐Based Spatial Clustering of Applications with Noise is used for clustering. Finally, class‐based term frequency–inverse document frequency is used to generate several main drug combinations from the clustered results. Results The highest frequency of drugs used included Buddleja officinalis, Bidens pilosa, Angelica sinensis, Eriocaulon buergerianum, and Raw Rehmannia glutinosa. The most frequent drug combinations were “Eriocaulon buergerianum, Raw Rehmannia glutinosa, Prunella vulgaris, Notopterygium incisum” “Lycii Fructus, Bidens pilosa, Buddleja officinalis” and “Kochiae Fructus, Cortex Dictamni.” Conclusions The proposed data mining algorithm based on BERTopic demonstrated promising outcomes in the analysis of TCM prescription medication rules. This method exhibited simplicity and efficiency, thereby offering a novel avenue for analysis. Our study utilizes the BERTopic algorithm to extract core prescriptions, subsequently analyzed the correlation between drugs based on their frequency. Firstly, Prescriptions are treated as documents and embedded using the BERT pre‐trained model. Following this, UMAP is used for the dimensionality reduction of these embedded prescriptions. These reduced‐dimension embeddings are clustered via HDBSCAN. Finally, core drugs are generated as topic words of the document using c‐TF‐IDF.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2834-443X
2834-4391
2834-4405
2834-4405
DOI:10.1002/med4.39