Research on a data mining algorithm based on BERTopic for medication rules in Traditional Chinese Medicine prescriptions
Background A data mining algorithm is proposed based on BERTopic to provide new insights into the analysis of medication rules in Traditional Chinese Medicine (TCM) prescriptions. Methods Using the BERTopic algorithm, collected TCM prescriptions for corneal diseases are converted to embeddings throu...
        Saved in:
      
    
          | Published in | Medicine Advances Vol. 1; no. 4; pp. 353 - 360 | 
|---|---|
| Main Authors | , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        Guangzhou
          John Wiley & Sons, Inc
    
        01.12.2023
     Wiley  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 2834-443X 2834-4391 2834-4405 2834-4405  | 
| DOI | 10.1002/med4.39 | 
Cover
| Summary: | Background
A data mining algorithm is proposed based on BERTopic to provide new insights into the analysis of medication rules in Traditional Chinese Medicine (TCM) prescriptions.
Methods
Using the BERTopic algorithm, collected TCM prescriptions for corneal diseases are converted to embeddings through a transformer based on the Bidirectional Encoder Representations from Transformers pre‐trained model. Then, Uniform Manifold Approximation and Projection is applied to perform dimensionality reduction in prescription embeddings. Subsequently, Hierarchical Density‐Based Spatial Clustering of Applications with Noise is used for clustering. Finally, class‐based term frequency–inverse document frequency is used to generate several main drug combinations from the clustered results.
Results
The highest frequency of drugs used included Buddleja officinalis, Bidens pilosa, Angelica sinensis, Eriocaulon buergerianum, and Raw Rehmannia glutinosa. The most frequent drug combinations were “Eriocaulon buergerianum, Raw Rehmannia glutinosa, Prunella vulgaris, Notopterygium incisum” “Lycii Fructus, Bidens pilosa, Buddleja officinalis” and “Kochiae Fructus, Cortex Dictamni.”
Conclusions
The proposed data mining algorithm based on BERTopic demonstrated promising outcomes in the analysis of TCM prescription medication rules. This method exhibited simplicity and efficiency, thereby offering a novel avenue for analysis.
Our study utilizes the BERTopic algorithm to extract core prescriptions, subsequently analyzed the correlation between drugs based on their frequency. Firstly, Prescriptions are treated as documents and embedded using the BERT pre‐trained model. Following this, UMAP is used for the dimensionality reduction of these embedded prescriptions. These reduced‐dimension embeddings are clustered via HDBSCAN. Finally, core drugs are generated as topic words of the document using c‐TF‐IDF. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14  | 
| ISSN: | 2834-443X 2834-4391 2834-4405 2834-4405  | 
| DOI: | 10.1002/med4.39 |