HGNChelper: identification and correction of invalid gene symbols for human and mouse [version 2; peer review: 3 approved]

Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome In...

Full description

Saved in:
Bibliographic Details
Published inF1000 research Vol. 9; p. 1493
Main Authors Oh, Sehyun, Abdelnabi, Jasmine, Al-Dulaimi, Ragheed, Aggarwal, Ayush, Ramos, Marcel, Davis, Sean, Riester, Markus, Waldron, Levi
Format Journal Article
LanguageEnglish
Published England F1000 Research Limited 2020
F1000 Research Ltd
Subjects
Online AccessGet full text
ISSN2046-1402
2046-1402
DOI10.12688/f1000research.28033.2

Cover

Abstract Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.
AbstractList Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.
Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.
Author Davis, Sean
Riester, Markus
Aggarwal, Ayush
Waldron, Levi
Ramos, Marcel
Oh, Sehyun
Abdelnabi, Jasmine
Al-Dulaimi, Ragheed
Author_xml – sequence: 1
  givenname: Sehyun
  surname: Oh
  fullname: Oh, Sehyun
  organization: Institute for Implementation Science and Population Health, New York, 10027, USA
– sequence: 2
  givenname: Jasmine
  surname: Abdelnabi
  fullname: Abdelnabi, Jasmine
  organization: Institute for Implementation Science and Population Health, New York, 10027, USA
– sequence: 3
  givenname: Ragheed
  surname: Al-Dulaimi
  fullname: Al-Dulaimi, Ragheed
  organization: School of Medicine, University of Utah, Utah, 84132, USA
– sequence: 4
  givenname: Ayush
  orcidid: 0000-0002-6587-3393
  surname: Aggarwal
  fullname: Aggarwal, Ayush
  organization: Academy of Scientific and Innovative Research, Ghaziabad, Uttar Pradesh, 201 002, India
– sequence: 5
  givenname: Marcel
  surname: Ramos
  fullname: Ramos, Marcel
  organization: Institute for Implementation Science and Population Health, New York, 10027, USA
– sequence: 6
  givenname: Sean
  orcidid: 0000-0002-8991-6458
  surname: Davis
  fullname: Davis, Sean
  organization: Center for Cancer Research, National Cancer Institute, Maryland, 20892, USA
– sequence: 7
  givenname: Markus
  orcidid: 0000-0002-4759-8332
  surname: Riester
  fullname: Riester, Markus
  organization: Novartis Institutes for BioMedical Research Incorporation, Massachusetts, 02139, USA
– sequence: 8
  givenname: Levi
  orcidid: 0000-0003-2725-0694
  surname: Waldron
  fullname: Waldron, Levi
  email: Levi.Waldron@sph.cuny.edu
  organization: Institute for Implementation Science and Population Health, New York, 10027, USA
BackLink https://www.ncbi.nlm.nih.gov/pubmed/33564398$$D View this record in MEDLINE/PubMed
BookMark eNqFkktv1DAUhSNUREvpX6i8ZDODX7GdASFVI2grVbCBFUKW41zPuEriYCdBw68nnZRRh01Xfp3vXOve8zo7aUMLWXZJ8JJQodQ7RzDGERKYaLdLqjBjS_oiO6OYiwXhmJ482Z9mFyndTwAuCiaofJWdMpYLzgp1lv25uf6y3kLdQVwhX0Hbe-et6X1okWkrZEOMYPfH4JBvR1P7Cm2gBZR2TRnqhFyIaDs0ZgaaMCRAP0aI6QGi71EHEFGE0cPvFWLIdF0MI1Q_32QvnakTXDyu59n3z5--rW8Wd1-vb9dXdwvLuaCLnNnSUceFECCxLLEUpoBCMWxyqyworLAgklIqJTBSVLYEm5NCVrkwonLsPLudfatg7nUXfWPiTgfj9f4ixI02sfe2Bk1kTiQvDOGm5Jyy0hSMyZxLWzHmpJq8Ps5e3VA2UNmpX9HUR6bHL63f6k0YtVS5ELKYDN4-GsTwa4DU68YnC3VtWpg6pxlRVAomcjlJL5_WOhT5N7xJIGaBjSGlCO4gIVjvg6KPgqL3QdF0Aj_8B1rf72c-_dnXz-OrGXfGDnW_exDpg-oZ-C9-xdnj
CitedBy_id crossref_primary_10_1080_15592294_2024_2375022
Cites_doi 10.1093/nar/gkw1033
10.1186/1471-2105-5-80
10.1093/jnci/dju049
10.1073/pnas.222373899
10.1093/bioinformatics/btm254
10.1038/s41598-019-52000-3
10.1038/s41588-020-0669-3
10.1093/nar/gkv007
10.1186/s13059-016-1044-7
10.1093/bioinformatics/btr260
10.1093/nar/gky1056
10.1093/nar/gkp1015
ContentType Journal Article
Copyright Copyright: © 2022 Oh S et al.
Copyright: © 2022 Oh S et al. 2022
Copyright_xml – notice: Copyright: © 2022 Oh S et al.
– notice: Copyright: © 2022 Oh S et al. 2022
DBID C-E
CH4
AAYXX
CITATION
NPM
7X8
5PM
DOA
DOI 10.12688/f1000research.28033.2
DatabaseName F1000Research
Faculty of 1000
CrossRef
PubMed
MEDLINE - Academic
PubMed Central (Full Participant titles)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
PubMed
MEDLINE - Academic
DatabaseTitleList
MEDLINE - Academic


PubMed
CrossRef
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ (Directory of Open Access Journals)
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
Women's Studies
EISSN 2046-1402
ExternalDocumentID oai_doaj_org_article_1751749a14ab4423ba9337547cd33f78
PMC7856679
33564398
10_12688_f1000research_28033_2
Genre Journal Article
GrantInformation_xml – fundername: National Institutes of Health
  grantid: U24-CA180996
GroupedDBID 3V.
53G
5VS
7X7
88I
8FE
8FH
8FI
8FJ
ABUWG
ACGOD
ACPRK
ADACO
ADBBV
ADRAZ
AFKRA
AHMBA
ALMA_UNASSIGNED_HOLDINGS
AOIJS
AZQEC
BAWUL
BBAFP
BBNVY
BCNDV
BENPR
BHPHI
BPHCQ
BVXVI
C-E
CH4
DIK
DWQXO
FRP
FYUFA
GNUQQ
GROUPED_DOAJ
GX1
HCIFZ
HYE
KQ8
LK8
M2P
M48
M7P
OK1
PIMPY
PQEST
PQQKQ
PQUKI
PRINS
PROAC
RPM
W2D
AAFWJ
AAYXX
AFPKN
ALIPV
CCPQU
CITATION
HMCUK
M~E
PGMZT
PHGZM
PHGZT
UKHRP
NPM
7X8
PQGLB
PUEGO
5PM
ID FETCH-LOGICAL-c4462-53cbf2f4666e707b076a9e9830a5c8ce808061722277e319dcbec5197d56a6df3
IEDL.DBID M48
ISSN 2046-1402
IngestDate Wed Aug 27 01:32:13 EDT 2025
Thu Aug 21 14:01:11 EDT 2025
Fri Sep 05 17:49:09 EDT 2025
Thu Apr 03 07:04:30 EDT 2025
Thu Apr 24 22:51:19 EDT 2025
Tue Jul 01 04:27:33 EDT 2025
Tue Jun 28 01:10:46 EDT 2022
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords molecular biology
HGNC
gene symbols
MGI
Language English
License http://creativecommons.org/licenses/by/4.0/: This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © 2022 Oh S et al.
This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c4462-53cbf2f4666e707b076a9e9830a5c8ce808061722277e319dcbec5197d56a6df3
Notes new_version
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
No competing interests were disclosed.
ORCID 0000-0002-4759-8332
0000-0003-2725-0694
0000-0002-6587-3393
0000-0002-8991-6458
OpenAccessLink http://journals.scholarsportal.info/openUrl.xqy?doi=10.12688/f1000research.28033.2
PMID 33564398
PQID 3182763657
PQPubID 23479
ParticipantIDs doaj_primary_oai_doaj_org_article_1751749a14ab4423ba9337547cd33f78
pubmedcentral_primary_oai_pubmedcentral_nih_gov_7856679
proquest_miscellaneous_3182763657
pubmed_primary_33564398
crossref_primary_10_12688_f1000research_28033_2
crossref_citationtrail_10_12688_f1000research_28033_2
faculty1000_research_10_12688_f1000research_28033_2
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2020-00-00
PublicationDateYYYYMMDD 2020-01-01
PublicationDate_xml – year: 2020
  text: 2020-00-00
PublicationDecade 2020
PublicationPlace England
PublicationPlace_xml – name: England
– name: London, UK
PublicationTitle F1000 research
PublicationTitleAlternate F1000Res
PublicationYear 2020
Publisher F1000 Research Limited
F1000 Research Ltd
Publisher_xml – name: F1000 Research Limited
– name: F1000 Research Ltd
References S Davis (ref-8) 2007; 23
B Yates (ref-5) 2017; 45
ref-7
B Zeeberg (ref-2) 2004; 5
A Culhane (ref-12) 2010; 38
M McCabe (ref-11) 2019; 9
C Bult (ref-6) 2019; 47
L Waldron (ref-13) 2014; 106
A Liberzon (ref-9) 2011; 27
M Ziemann (ref-3) 2016; 17
M Ritchie (ref-10) 2020; 43
E Bruford (ref-4) 2020; 52
A Poux (ref-1) 2002; 99
References_xml – volume: 45
  start-page: D619-D625
  year: 2017
  ident: ref-5
  article-title: Genenames.org: the HGNC and VGNC resources in 2017.
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkw1033
– volume: 5
  start-page: 80
  year: 2004
  ident: ref-2
  article-title: Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics.
  publication-title: BMC Bioinformatics.
  doi: 10.1186/1471-2105-5-80
– volume: 106
  year: 2014
  ident: ref-13
  article-title: Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer.
  publication-title: J Natl Cancer Inst.
  doi: 10.1093/jnci/dju049
– volume: 99
  start-page: 14065-70
  year: 2002
  ident: ref-1
  article-title: Structure of the GCN5 histone acetyltransferase bound to a bisubstrate inhibitor.
  publication-title: Proc Natl Acad Sci U S A.
  doi: 10.1073/pnas.222373899
– volume: 23
  start-page: 1846-1847
  year: 2007
  ident: ref-8
  article-title: GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.
  publication-title: Bioinformatics.
  doi: 10.1093/bioinformatics/btm254
– volume: 9
  start-page: 17052
  year: 2019
  ident: ref-11
  article-title: Development and validation of a targeted gene sequencing panel for application to disparate cancers.
  publication-title: Sci Rep.
  doi: 10.1038/s41598-019-52000-3
– volume: 52
  start-page: 754-758
  year: 2020
  ident: ref-4
  article-title: Guidelines for human gene nomenclature.
  publication-title: Nat Genet.
  doi: 10.1038/s41588-020-0669-3
– ident: ref-7
  article-title: Home | HUGO Gene Nomenclature Committee.
– volume: 43
  start-page: e47
  year: 2020
  ident: ref-10
  article-title: limma powers differential expression analyses for RNA-sequencing and microarray studies.
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkv007
– volume: 17
  start-page: 177
  year: 2016
  ident: ref-3
  article-title: Gene name errors are widespread in the scientific literature.
  publication-title: Genome Biol.
  doi: 10.1186/s13059-016-1044-7
– volume: 27
  start-page: 1739-1740
  year: 2011
  ident: ref-9
  article-title: Molecular signatures database (MSigDB) 3.0.
  publication-title: Bioinformatics.
  doi: 10.1093/bioinformatics/btr260
– volume: 47
  start-page: D801-D806
  year: 2019
  ident: ref-6
  article-title: Mouse Genome Database (MGD) 2019.
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gky1056
– volume: 38
  start-page: D716-25
  year: 2010
  ident: ref-12
  article-title: GeneSigDB--a curated database of gene expression signatures.
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkp1015
SSID ssj0000993627
Score 2.2771323
Snippet Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by...
SourceID doaj
pubmedcentral
proquest
pubmed
crossref
faculty1000
SourceType Open Website
Open Access Repository
Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 1493
SubjectTerms eng
gene symbols
HGNC
MGI
molecular biology
Software Tool
SummonAdditionalLinks – databaseName: DOAJ Directory of Open Access Journals
  dbid: DOA
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Na9wwEBUlh9IcQpt-bb9QodCTE0eSLTs9taHpUkhODQRKEZI1YjdsvctuGkh_fWckZVmXwl56tS1keZ6kN2PNG8beuSArgCNbaK9VoZwWhZPCFcoLL1zj2iSkfXZejy_U18vqcqPUF50JS_LA6cMd4vaGpLm1R8o6hXu_s-iC60rpzksZdEzzLdtyw5m6SrwHV2adU4JFjW5eoEh2VtCZHFBRJnkgBrtRFO3fZbvBkurFLT3_L9759_HJjf3o9CHby0SSf0wDeMTuQb_P7p_lX-X7bC_Wpny_4vmk4GP2e_zl_GQCswUsj_nU52NC0TLc9p53VKgjpjnweeDTHjE49RwBBnx1-9PNZyuODJfHqn6xAUUNgH-_SSE3Lj7wBcCSp2yYYy55FCy_Af_jCbs4_fztZFzk0gtFh_4huqeyc0EEhc4N6FK7Ute2hbaRpa26pgNSoyTuI4TWgLPYd4gFyoH1VW1rH-RTttPPe3jOuAOkPA4ceSoqBGub2uGiSipACumPH7HqzgSmy7rkVB5jZsg_IdOZgelMNJ0RI3a4brdIyhxbW3wiC6-fJmXteAHxZjLezDa8jZjcwIdZ97Gt67d3ODI4f-mnjO0BrWRwTRW4xteVHrFnCVfrF5SyIsKIfeoB4gYjGN7pp5OoEa4b5Om6ffE_hvySPRAUZYiBp1ds53r5C14jFbt2b-Ks-wOXSS8k
  priority: 102
  providerName: Directory of Open Access Journals
Title HGNChelper: identification and correction of invalid gene symbols for human and mouse [version 2; peer review: 3 approved]
URI http://dx.doi.org/10.12688/f1000research.28033.2
https://www.ncbi.nlm.nih.gov/pubmed/33564398
https://www.proquest.com/docview/3182763657
https://pubmed.ncbi.nlm.nih.gov/PMC7856679
https://doaj.org/article/1751749a14ab4423ba9337547cd33f78
Volume 9
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3da9swED-6Fsb6ULbuo9lH0GCwJ2eJZFv2YIy1tA2DhjEWyJuQLGnJyJw26WD573cnK6EeZd2LH2zLSnwf_t3p9DuAN8aLzLmBTqSVaZIayRMjuElSyy03hSkbIu2LUT4cp58n2WQHNu1S4wtc3RraUT-p8XLe-321_ogG_yFwI-QYwXlKUkdynGmP-i2JHrrlvbBmROV8EfL_aBAR-mzaRc0xNEwGocCn--9Htb5Zgdp_H_a9Jm6MNd1_Gzr9u8jyxlfr7CEcRLjJPjX68Qh2XH0I9y_igvohHIQOlm9XLNYTPobx8Hx0MnXzS7d8z2Y2FhMF-TFdW1ZRO4-wGYItPJvVqKkzy1ANHVutf5rFfMUQB7PQ-y8MoNyCewLjs9NvJ8Mk9l5IKgwQMT4VlfHcpxjdONmXpi9zXbqyEH2dVUXliI6SwA_nUjo0Y1uhMtAmWJvlOrdePIXdelG7I2DGIeYxzlCoknqvdZEb9KpEA5Qi_rEdyDZvV1WRmJz6Y8wVBSgkFdWSigpSUbwD77bjLhtqjjtHHJPwtncTtXY4sVh-V9FSFeIpjNJKPUi1SRFsGl0K6hMsKyuEl0UHxA3Rq-0cd039eqMiCg2YVmV07VAACp0qRyefZ7IDzxqV2f5AITJCjDinbClT6x-0r9SzaSAJlwUCdVk-_495X8ADTlmEkFh6CbvXy1_uFUKta9OFe3Iiu7B3fDr68rUbEhZ4PJ8MusGq_gC4KyjT
linkProvider Scholars Portal
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=HGNChelper%3A+identification+and+correction+of+invalid+gene+symbols+for+human+and+mouse&rft.jtitle=F1000+research&rft.au=Oh%2C+Sehyun&rft.au=Abdelnabi%2C+Jasmine&rft.au=Al-Dulaimi%2C+Ragheed&rft.au=Aggarwal%2C+Ayush&rft.date=2020&rft.issn=2046-1402&rft.eissn=2046-1402&rft.volume=9&rft.spage=1493&rft_id=info:doi/10.12688%2Ff1000research.28033.2&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2046-1402&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2046-1402&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2046-1402&client=summon