Systematic tissue annotations of genomics samples by modeling unstructured metadata

There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are rou...

Full description

Saved in:

Bibliographic Details
Published in	Nature communications Vol. 13; no. 1; pp. 6736 - 13
Main Authors	Hawkins, Nathaniel T., Maldaver, Marc, Yannakopoulos, Anna, Guare, Lindsay A., Krishnan, Arjun
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 08.11.2022 Nature Publishing Group Nature Portfolio
Subjects	38 45 631/114/1305 631/114/2164 631/114/2401 631/1647/514 Annotations Biological activity Classification Data collection Descriptions Gene expression Genomics Humanities and Social Sciences Humans Language Learning algorithms Machine Learning Metadata multidisciplinary Natural language Natural Language Processing Representations Science Science (multidisciplinary) Signal processing String matching Tissues Unstructured data
Online Access	Get full text
ISSN	2041-1723 2041-1723
DOI	10.1038/s41467-022-34435-x

Cover

More Information
Summary:	There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto . The 1+ million publicly-available human –omics samples currently remain acutely underused. Here the authors present an approach combining natural language processing and machine learning to infer the source tissue of public genomics samples based on their plain text descriptions, making these samples easy to discover and reuse.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2041-1723 2041-1723
DOI:	10.1038/s41467-022-34435-x