Towards self-learning based hypotheses generation in biomedical text domain
Abstract Motivation The overwhelming amount of research articles in the domain of bio-medicine might cause important connections to remain unnoticed. Literature Based Discovery is a sub-field within biomedical text mining that peruses these articles to formulate high confident hypotheses on possible...
        Saved in:
      
    
          | Published in | Bioinformatics Vol. 34; no. 12; pp. 2103 - 2115 | 
|---|---|
| Main Authors | , , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        England
          Oxford University Press
    
        15.06.2018
     | 
| Online Access | Get full text | 
| ISSN | 1367-4803 1367-4811 1460-2059 1367-4811  | 
| DOI | 10.1093/bioinformatics/btx837 | 
Cover
| Summary: | Abstract
Motivation
The overwhelming amount of research articles in the domain of bio-medicine might cause important connections to remain unnoticed. Literature Based Discovery is a sub-field within biomedical text mining that peruses these articles to formulate high confident hypotheses on possible connections between medical concepts. Although many alternate methodologies have been proposed over the last decade, they still suffer from scalability issues. The primary reason, apart from the dense inter-connections between biological concepts, is the absence of information on the factors that lead to the edge-formation. In this work, we formulate this problem as a collaborative filtering task and leverage a relatively new concept of word-vectors to learn and mimic the implicit edge-formation process. Along with single-class classifier, we prune the search-space of redundant and irrelevant hypotheses to increase the efficiency of the system and at the same time maintaining and in some cases even boosting the overall accuracy.
Results
We show that our proposed framework is able to prune up to 90% of the hypotheses while still retaining high recall in top-K results. This level of efficiency enables the discovery algorithm to look for higher-order hypotheses, something that was infeasible until now. Furthermore, the generic formulation allows our approach to be agile to perform both open and closed discovery. We also experimentally validate that the core data-structures upon which the system bases its decision has a high concordance with the opinion of the experts.This coupled with the ability to understand the edge formation process provides us with interpretable results without any manual intervention.
Availability and implementation
The relevant JAVA codes are available at: https://github.com/vishrawas/Medline–Code_v2.
Supplementary information
Supplementary data are available at Bioinformatics online. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23  | 
| ISSN: | 1367-4803 1367-4811 1460-2059 1367-4811  | 
| DOI: | 10.1093/bioinformatics/btx837 |