scValue: value-based subsampling of large-scale single-cell transcriptomic data for machine and deep learning tasks

Abstract Large single-cell ribonucleic acid-sequencing (scRNA-seq) datasets offer unprecedented biological insights but present substantial computational challenges for visualization and analysis. While existing subsampling methods can enhance efficiency, they may not ensure optimal performance in d...

Full description

Saved in:

Bibliographic Details
Published in	Briefings in bioinformatics Vol. 26; no. 3
Main Authors	Huang, Li, Gong, Weikang, Chen, Dongsheng
Format	Journal Article
Language	English
Published	England Oxford University Press 01.05.2025 Oxford Publishing Limited (England)
Subjects	Algorithms Annotations Biological effects Cognitive tasks Computational Biology - methods Computer applications Computing time Datasets Deconvolution Deep Learning Gene Expression Profiling Humans Labels Learning algorithms Lymphocytes T Machine Learning Metric space Problem Solving Protocol Representations RNA-Seq Single-Cell Analysis - methods Software Source code Transcriptome Transcriptomics Transfer learning machine and deep learning single-cell transcriptomics cell type analysis data valuation subsampling
Online Access	Get full text
ISSN	1467-5463 1477-4054 1477-4054
DOI	10.1093/bib/bbaf279

Cover

More Information
Summary:	Abstract Large single-cell ribonucleic acid-sequencing (scRNA-seq) datasets offer unprecedented biological insights but present substantial computational challenges for visualization and analysis. While existing subsampling methods can enhance efficiency, they may not ensure optimal performance in downstream machine learning and deep learning (ML/DL) tasks. Here, we introduce scValue, a novel approach that ranks individual cells by ‘data value’ using out-of-bag estimates from a random forest model. scValue prioritizes high-value cells and allocates greater representation to cell types with higher variability in data value, effectively preserving key biological signals within subsamples. We benchmarked scValue on automatic cell-type annotation tasks across four large datasets, paired with distinct ML/DL models. Our method consistently outperformed existing subsampling methods, closely matching full-data performance across all annotation tasks. In three additional case studies—label transfer learning, cross-study label harmonization, and bulk RNA-seq deconvolution—scValue more effectively preserved T-cell annotations across human gut-colon datasets, more accurately reproduced T-cell subtype relationships in a human spleen dataset, and constructed a more reliable single-cell immune reference for cell-type deconvolution in simulated bulk tissue samples. Finally, using 16 public datasets ranging from tens of thousands to millions of cells, we evaluated subsampling quality based on computational time, Gini coefficient, and Hausdorff distance. scValue demonstrated fast execution, well-balanced cell-type representation, and distributional properties akin to uniform sampling. Overall, scValue provides a robust and scalable solution for subsampling large scRNA-seq data in ML/DL workflows. It is available as an open-source Python package installable via pip, with source code at https://github.com/LHBCB/scvalue.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1467-5463 1477-4054 1477-4054
DOI:	10.1093/bib/bbaf279