Genomics Data Analysis via Spectral Shape and Topology

Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a better understanding of the intrinsic shape of high-dimensional genomic data and to retain information that may be lost using standard dimens...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Amézquita, Erik J, Farzana Nasrin, Storey, Kathleen M, Yoshizawa, Masato
Format Paper Journal Article
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 02.11.2022
Subjects
Online AccessGet full text
ISSN2331-8422
DOI10.48550/arxiv.2211.00938

Cover

Abstract Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a better understanding of the intrinsic shape of high-dimensional genomic data and to retain information that may be lost using standard dimension-reduction algorithms. We propose a novel workflow to process and analyze RNA-seq data from tumor and healthy subjects integrating Mapper and differential gene expression. Precisely, we show that a Gaussian mixture approximation method can be used to produce graphical structures that successfully separate tumor and healthy subjects, and produce two subgroups of tumor subjects. A further analysis using DESeq2, a popular tool for the detection of differentially expressed genes, shows that these two subgroups of tumor cells bear two distinct gene regulations, suggesting two discrete paths for forming lung cancer, which could not be highlighted by other popular clustering methods, including t-SNE. Although Mapper shows promise in analyzing high-dimensional data, building tools to statistically analyze Mapper graphical structures is limited in the existing literature. In this paper, we develop a scoring method using heat kernel signatures that provides an empirical setting for statistical inferences such as hypothesis testing, sensitivity analysis, and correlation analysis.
AbstractList Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a better understanding of the intrinsic shape of high-dimensional genomic data and to retain information that may be lost using standard dimension-reduction algorithms. We propose a novel workflow to process and analyze RNA-seq data from tumor and healthy subjects integrating Mapper and differential gene expression. Precisely, we show that a Gaussian mixture approximation method can be used to produce graphical structures that successfully separate tumor and healthy subjects, and produce two subgroups of tumor subjects. A further analysis using DESeq2, a popular tool for the detection of differentially expressed genes, shows that these two subgroups of tumor cells bear two distinct gene regulations, suggesting two discrete paths for forming lung cancer, which could not be highlighted by other popular clustering methods, including t-SNE. Although Mapper shows promise in analyzing high-dimensional data, building tools to statistically analyze Mapper graphical structures is limited in the existing literature. In this paper, we develop a scoring method using heat kernel signatures that provides an empirical setting for statistical inferences such as hypothesis testing, sensitivity analysis, and correlation analysis.
Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a better understanding of the intrinsic shape of high-dimensional genomic data and to retain information that may be lost using standard dimension-reduction algorithms. We propose a novel workflow to process and analyze RNA-seq data from tumor and healthy subjects integrating Mapper and differential gene expression. Precisely, we show that a Gaussian mixture approximation method can be used to produce graphical structures that successfully separate tumor and healthy subjects, and produce two subgroups of tumor subjects. A further analysis using DESeq2, a popular tool for the detection of differentially expressed genes, shows that these two subgroups of tumor cells bear two distinct gene regulations, suggesting two discrete paths for forming lung cancer, which could not be highlighted by other popular clustering methods, including t-SNE. Although Mapper shows promise in analyzing high-dimensional data, building tools to statistically analyze Mapper graphical structures is limited in the existing literature. In this paper, we develop a scoring method using heat kernel signatures that provides an empirical setting for statistical inferences such as hypothesis testing, sensitivity analysis, and correlation analysis.
Author Farzana Nasrin
Yoshizawa, Masato
Storey, Kathleen M
Amézquita, Erik J
Author_xml – sequence: 1
  givenname: Erik
  surname: Amézquita
  middlename: J
  fullname: Amézquita, Erik J
– sequence: 2
  fullname: Farzana Nasrin
– sequence: 3
  givenname: Kathleen
  surname: Storey
  middlename: M
  fullname: Storey, Kathleen M
– sequence: 4
  givenname: Masato
  surname: Yoshizawa
  fullname: Yoshizawa, Masato
BackLink https://doi.org/10.1371/journal.pone.0284820$$DView published paper (Access to full text may be restricted)
https://doi.org/10.48550/arXiv.2211.00938$$DView paper in arXiv
BookMark eNotj8FOwzAQRC0EEqX0AzhhiXOCvRsn7rEqtCBV4tDco43jQKo0DnZa0b8ntFxmLk-jeXfsunOdZexBijjRSoln8j_NMQaQMhZijvqKTQBRRjoBuGWzEHZCCEgzUAonLF3bzu0bE_gLDcQXHbWn0AR-bIhve2sGTy3fflFvOXUVz13vWvd5umc3NbXBzv57yvLVa758izYf6_flYhORAozKirRKamsTlUEJWiKhFXbMuTCKqtRoLLUuNaGo7XiKrDRJqQFBZ6UBnLLHy-xZquh9syd_Kv7kirPcSDxdiN6774MNQ7FzBz9ahAIylKkUWYr4C7wiUmg
ContentType Paper
Journal Article
Copyright 2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: 2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID 8FE
8FG
ABJCF
ABUWG
AFKRA
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
HCIFZ
L6V
M7S
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
AKZ
ALC
EPD
GOX
DOI 10.48550/arxiv.2211.00938
DatabaseName ProQuest SciTech Collection
ProQuest Technology Collection
Materials Science & Engineering Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
ProQuest Central
Technology Collection
ProQuest One Community College
ProQuest Central
SciTech Premium Collection
ProQuest Engineering Collection
Engineering Database
ProQuest Central Premium
ProQuest One Academic
ProQuest Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
Engineering Collection
arXiv Mathematics
arXiv Quantitative Biology
arXiv Statistics
arXiv.org
DatabaseTitle Publicly Available Content Database
Engineering Database
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest Engineering Collection
ProQuest One Academic UKI Edition
ProQuest Central Korea
Materials Science & Engineering Collection
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
Engineering Collection
DatabaseTitleList
Publicly Available Content Database
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
– sequence: 2
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 2331-8422
ExternalDocumentID 2211_00938
Genre Working Paper/Pre-Print
GroupedDBID 8FE
8FG
ABJCF
ABUWG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
FRJ
HCIFZ
L6V
M7S
M~E
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
AKZ
ALC
EPD
GOX
ID FETCH-LOGICAL-a523-bda854fee4572b2813a3e0e3a390c5ad6c83b88b8a30fe026ae1c4b823287bc23
IEDL.DBID BENPR
IngestDate Tue Jul 22 23:13:41 EDT 2025
Mon Jun 30 09:30:46 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a523-bda854fee4572b2813a3e0e3a390c5ad6c83b88b8a30fe026ae1c4b823287bc23
Notes SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
OpenAccessLink https://www.proquest.com/docview/2731610763?pq-origsite=%requestingapplication%&accountid=15518
PQID 2731610763
PQPubID 2050157
ParticipantIDs arxiv_primary_2211_00938
proquest_journals_2731610763
PublicationCentury 2000
PublicationDate 20221102
2022-11-02
PublicationDateYYYYMMDD 2022-11-02
PublicationDate_xml – month: 11
  year: 2022
  text: 20221102
  day: 02
PublicationDecade 2020
PublicationPlace Ithaca
PublicationPlace_xml – name: Ithaca
PublicationTitle arXiv.org
PublicationYear 2022
Publisher Cornell University Library, arXiv.org
Publisher_xml – name: Cornell University Library, arXiv.org
SSID ssj0002672553
Score 1.814917
SecondaryResourceType preprint
Snippet Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a...
Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a...
SourceID arxiv
proquest
SourceType Open Access Repository
Aggregation Database
SubjectTerms Algorithms
Clustering
Correlation analysis
Data analysis
Dimensional analysis
Empirical analysis
Gene expression
Graphical representations
Hypothesis testing
Mathematics - Algebraic Topology
Quantitative Biology - Genomics
Sensitivity analysis
Statistics - Other Statistics
Subgroups
Topology
Tumors
Workflow
SummonAdditionalLinks – databaseName: arXiv.org
  dbid: GOX
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09TwMxDLXaTiwIBKiFgjKwnkjzcc2NCCgVEjBQpG4nO-cTXVDVlor-e5LcVQyIJUOUDHGi2M-J3wO4tk575SrM8mpUZ4YCTimQfIaSc8Ra28rE4uTnl3z6bp7mdt4Bsa-FwdX3YtvwA9P6RqnIsBlAt-tCNwQKsZj3dd48TiYqrnb877gQY6auP1dr8heTIzhsAz1x2-zMMXT48wTyR05lwGtxjxsUe0YQsV2giErwMe0g3j5wySIgfDFrFAx2pzCbPMzuplmrXJBhAHYZVeisqZmNHStSbqRRs-TQFtJbrHLvNDlHDrWsOaAg5JE35EJ048bklT6DXgD_3AdRR2UMWRXWEhnvNSEGVDkmqdkXXhYD6Kf1lsuGnKKMpiiTKQYw3JugbA_mulRRqSpAvlyf_z_zAg5U_OUfs6dqCL3N6osvg-_d0FXagB-d14Os
  priority: 102
  providerName: Cornell University
Title Genomics Data Analysis via Spectral Shape and Topology
URI https://www.proquest.com/docview/2731610763
https://arxiv.org/abs/2211.00938
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LSwMxEB76QPDmk1ZrycHr2m2yj-xBBLUPhNaiFXpbJo_FXra1rUUv_naTdFcPgpdAsuwhkzCZb5L5PoDLkDNJuUIvUt3MC4TBKQkK6aGvI8SMhSqwxcmjcTR8CR5m4awC47IWxj6rLH2ic9RqIW2OvEOtxJLBKhG7Wb55VjXK3q6WEhpYSCuoa0cxVoU6tcxYNajf9saTp5-sC41iE0Oz3fWmI_Pq4Opjvr2i1HJ5GnjPTZTqhv44Z3fi9A-gPsGlXh1CRedHsOceasr1MUQD7QqJ1-QeN0hKThGynSOxWvI2cUGeX82_BHNFpjsNhM8TmPZ707uhV2gfeGigoScU8jDItA7CmArKuwyZ9rVpE1-GqCLJmeBccGR-ps20UHdlILiJj3gsJGWnUMsXuW4Ayay2hq-SMBQikJIJRINLY-EzLRPpJ01ouPmmyx29RWpNkTpTNKFVmiAttvY6_V2Is_8_n8M-tbUCNgdLW1DbrN71hTnBN6INVd4ftIvFMb3B48y0o6_eN3hNnfw
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV27TsMwFLVQKwQbT7U8PcAYSG0ncYYKCVoo0FYIitQtun5EdAmlKYV-HP-G7SYwILF1yZAoQ64d33uufc5B6CTgVBKuwAtVI_WYMDglBiE98HUIkNJAMUtO7vXDzjO7GwbDFfRVcmHsscpyTXQLtXqVtkd-TqzFksEqIb0Yv3nWNcrurpYWGlBYK6imkxgriB33ev5hIFzevG2Z8T4l5Lo9uOp4hcuABwaEeUIBD1iqNQsiIghvUKDa1-Ya-zIAFUpOBeeCA_VTbRAL6IZkgptKhEdCWt0DkwGqjLLYYL_qZbv_8PjT5CFhZEp2uthNddph5zD5HM3OCLHSoX5saTFVd-tPLnAJ7noDVR9grCebaEVnW2jVnQuV-TYKb7TjLee4BVPApYQJno0AW-t62yfBTy_mXQyZwoOF5cJ8Bw2WEYRdVMleM11DOLVWHr6Kg0AIJiUVAAYGR8KnWsbSj-uo5r43GS_UNBIbisSFoo4OyhAkxZ-UJ7_jvvf_42O01hn0ukn3tn-_j9aJpSnY9i85QJXp5F0fmuJhKo6KIcIoWfKk-Aau3dc6
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Genomics+Data+Analysis+via+Spectral+Shape+and+Topology&rft.jtitle=arXiv.org&rft.au=Am%C3%A9zquita%2C+Erik+J&rft.au=Farzana+Nasrin&rft.au=Storey%2C+Kathleen+M&rft.au=Yoshizawa%2C+Masato&rft.date=2022-11-02&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422&rft_id=info:doi/10.48550%2Farxiv.2211.00938