An Interpretable Deep Embedding Model for Few and Imbalanced Biomedical Data

In healthcare, training examples are usually hard to obtain (e.g., cases of a rare disease), or the cost of labelling data is high. With a large number of features (<inline-formula><tex-math notation="LaTeX">p</tex-math></inline-formula>) be measured in a relatively...

Full description

Saved in:
Bibliographic Details
Published inIEEE journal of biomedical and health informatics Vol. PP; pp. 1 - 8
Main Authors Wang, Haishuai, Yang, Jianjun, Tao, Guangyu, Ma, Jiali, Chi, Lianhua, Wu, Jun, Zhao, Ziping
Format Journal Article
LanguageEnglish
Published United States IEEE 21.11.2022
Subjects
Online AccessGet full text
ISSN2168-2194
2168-2208
2168-2208
DOI10.1109/JBHI.2022.3223798

Cover

More Information
Summary:In healthcare, training examples are usually hard to obtain (e.g., cases of a rare disease), or the cost of labelling data is high. With a large number of features (<inline-formula><tex-math notation="LaTeX">p</tex-math></inline-formula>) be measured in a relatively small number of samples (<inline-formula><tex-math notation="LaTeX">N</tex-math></inline-formula>), the "big <inline-formula><tex-math notation="LaTeX">p</tex-math></inline-formula>, small <inline-formula><tex-math notation="LaTeX">N</tex-math></inline-formula>" problem is an important subject in healthcare studies, especially on the genomic data. Another major challenge of effectively analyzing medical data is the skewed class distribution caused by the imbalance between different class labels. In addition, feature importance and interpretability play a crucial role in the success of solving medical problems. Therefore, in this paper, we present an interpretable deep embedding model (IDEM) to classify new data having seen only a few training examples with highly skewed class distribution. IDEM model consists of a feature attention layer to learn the informative features, a feature embedding layer to directly deal with both numerical and categorical features, a siamese network with contrastive loss to compare the similarity between learned embeddings of two input samples. Experiments on both synthetic data and real-world medical data demonstrate that our IDEM model has better generalization power than conventional approaches with few and imbalanced training medical samples, and it is able to identify which features contribute to the classifier in distinguishing case and control.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2168-2194
2168-2208
2168-2208
DOI:10.1109/JBHI.2022.3223798