HGNChelper: identification and correction of invalid gene symbols for human and mouse [version 2; peer review: 3 approved]
Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome In...
Saved in:
Published in | F1000 research Vol. 9; p. 1493 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
England
F1000 Research Limited
2020
F1000 Research Ltd |
Subjects | |
Online Access | Get full text |
ISSN | 2046-1402 2046-1402 |
DOI | 10.12688/f1000research.28033.2 |
Cover
Abstract | Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN. |
---|---|
AbstractList | Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN. Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN. |
Author | Davis, Sean Riester, Markus Aggarwal, Ayush Waldron, Levi Ramos, Marcel Oh, Sehyun Abdelnabi, Jasmine Al-Dulaimi, Ragheed |
Author_xml | – sequence: 1 givenname: Sehyun surname: Oh fullname: Oh, Sehyun organization: Institute for Implementation Science and Population Health, New York, 10027, USA – sequence: 2 givenname: Jasmine surname: Abdelnabi fullname: Abdelnabi, Jasmine organization: Institute for Implementation Science and Population Health, New York, 10027, USA – sequence: 3 givenname: Ragheed surname: Al-Dulaimi fullname: Al-Dulaimi, Ragheed organization: School of Medicine, University of Utah, Utah, 84132, USA – sequence: 4 givenname: Ayush orcidid: 0000-0002-6587-3393 surname: Aggarwal fullname: Aggarwal, Ayush organization: Academy of Scientific and Innovative Research, Ghaziabad, Uttar Pradesh, 201 002, India – sequence: 5 givenname: Marcel surname: Ramos fullname: Ramos, Marcel organization: Institute for Implementation Science and Population Health, New York, 10027, USA – sequence: 6 givenname: Sean orcidid: 0000-0002-8991-6458 surname: Davis fullname: Davis, Sean organization: Center for Cancer Research, National Cancer Institute, Maryland, 20892, USA – sequence: 7 givenname: Markus orcidid: 0000-0002-4759-8332 surname: Riester fullname: Riester, Markus organization: Novartis Institutes for BioMedical Research Incorporation, Massachusetts, 02139, USA – sequence: 8 givenname: Levi orcidid: 0000-0003-2725-0694 surname: Waldron fullname: Waldron, Levi email: Levi.Waldron@sph.cuny.edu organization: Institute for Implementation Science and Population Health, New York, 10027, USA |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/33564398$$D View this record in MEDLINE/PubMed |
BookMark | eNqFkktv1DAUhSNUREvpX6i8ZDODX7GdASFVI2grVbCBFUKW41zPuEriYCdBw68nnZRRh01Xfp3vXOve8zo7aUMLWXZJ8JJQodQ7RzDGERKYaLdLqjBjS_oiO6OYiwXhmJ482Z9mFyndTwAuCiaofJWdMpYLzgp1lv25uf6y3kLdQVwhX0Hbe-et6X1okWkrZEOMYPfH4JBvR1P7Cm2gBZR2TRnqhFyIaDs0ZgaaMCRAP0aI6QGi71EHEFGE0cPvFWLIdF0MI1Q_32QvnakTXDyu59n3z5--rW8Wd1-vb9dXdwvLuaCLnNnSUceFECCxLLEUpoBCMWxyqyworLAgklIqJTBSVLYEm5NCVrkwonLsPLudfatg7nUXfWPiTgfj9f4ixI02sfe2Bk1kTiQvDOGm5Jyy0hSMyZxLWzHmpJq8Ps5e3VA2UNmpX9HUR6bHL63f6k0YtVS5ELKYDN4-GsTwa4DU68YnC3VtWpg6pxlRVAomcjlJL5_WOhT5N7xJIGaBjSGlCO4gIVjvg6KPgqL3QdF0Aj_8B1rf72c-_dnXz-OrGXfGDnW_exDpg-oZ-C9-xdnj |
CitedBy_id | crossref_primary_10_1080_15592294_2024_2375022 |
Cites_doi | 10.1093/nar/gkw1033 10.1186/1471-2105-5-80 10.1093/jnci/dju049 10.1073/pnas.222373899 10.1093/bioinformatics/btm254 10.1038/s41598-019-52000-3 10.1038/s41588-020-0669-3 10.1093/nar/gkv007 10.1186/s13059-016-1044-7 10.1093/bioinformatics/btr260 10.1093/nar/gky1056 10.1093/nar/gkp1015 |
ContentType | Journal Article |
Copyright | Copyright: © 2022 Oh S et al. Copyright: © 2022 Oh S et al. 2022 |
Copyright_xml | – notice: Copyright: © 2022 Oh S et al. – notice: Copyright: © 2022 Oh S et al. 2022 |
DBID | C-E CH4 AAYXX CITATION NPM 7X8 5PM DOA |
DOI | 10.12688/f1000research.28033.2 |
DatabaseName | F1000Research Faculty of 1000 CrossRef PubMed MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef PubMed MEDLINE - Academic |
DatabaseTitleList | MEDLINE - Academic PubMed CrossRef |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ (Directory of Open Access Journals) url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Medicine Women's Studies |
EISSN | 2046-1402 |
ExternalDocumentID | oai_doaj_org_article_1751749a14ab4423ba9337547cd33f78 PMC7856679 33564398 10_12688_f1000research_28033_2 |
Genre | Journal Article |
GrantInformation_xml | – fundername: National Institutes of Health grantid: U24-CA180996 |
GroupedDBID | 3V. 53G 5VS 7X7 88I 8FE 8FH 8FI 8FJ ABUWG ACGOD ACPRK ADACO ADBBV ADRAZ AFKRA AHMBA ALMA_UNASSIGNED_HOLDINGS AOIJS AZQEC BAWUL BBAFP BBNVY BCNDV BENPR BHPHI BPHCQ BVXVI C-E CH4 DIK DWQXO FRP FYUFA GNUQQ GROUPED_DOAJ GX1 HCIFZ HYE KQ8 LK8 M2P M48 M7P OK1 PIMPY PQEST PQQKQ PQUKI PRINS PROAC RPM W2D AAFWJ AAYXX AFPKN ALIPV CCPQU CITATION HMCUK M~E PGMZT PHGZM PHGZT UKHRP NPM 7X8 PQGLB PUEGO 5PM |
ID | FETCH-LOGICAL-c4462-53cbf2f4666e707b076a9e9830a5c8ce808061722277e319dcbec5197d56a6df3 |
IEDL.DBID | M48 |
ISSN | 2046-1402 |
IngestDate | Wed Aug 27 01:32:13 EDT 2025 Thu Aug 21 14:01:11 EDT 2025 Fri Sep 05 17:49:09 EDT 2025 Thu Apr 03 07:04:30 EDT 2025 Thu Apr 24 22:51:19 EDT 2025 Tue Jul 01 04:27:33 EDT 2025 Tue Jun 28 01:10:46 EDT 2022 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Keywords | molecular biology HGNC gene symbols MGI |
Language | English |
License | http://creativecommons.org/licenses/by/4.0/: This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Copyright: © 2022 Oh S et al. This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c4462-53cbf2f4666e707b076a9e9830a5c8ce808061722277e319dcbec5197d56a6df3 |
Notes | new_version ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 No competing interests were disclosed. |
ORCID | 0000-0002-4759-8332 0000-0003-2725-0694 0000-0002-6587-3393 0000-0002-8991-6458 |
OpenAccessLink | http://journals.scholarsportal.info/openUrl.xqy?doi=10.12688/f1000research.28033.2 |
PMID | 33564398 |
PQID | 3182763657 |
PQPubID | 23479 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_1751749a14ab4423ba9337547cd33f78 pubmedcentral_primary_oai_pubmedcentral_nih_gov_7856679 proquest_miscellaneous_3182763657 pubmed_primary_33564398 crossref_primary_10_12688_f1000research_28033_2 crossref_citationtrail_10_12688_f1000research_28033_2 faculty1000_research_10_12688_f1000research_28033_2 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2020-00-00 |
PublicationDateYYYYMMDD | 2020-01-01 |
PublicationDate_xml | – year: 2020 text: 2020-00-00 |
PublicationDecade | 2020 |
PublicationPlace | England |
PublicationPlace_xml | – name: England – name: London, UK |
PublicationTitle | F1000 research |
PublicationTitleAlternate | F1000Res |
PublicationYear | 2020 |
Publisher | F1000 Research Limited F1000 Research Ltd |
Publisher_xml | – name: F1000 Research Limited – name: F1000 Research Ltd |
References | S Davis (ref-8) 2007; 23 B Yates (ref-5) 2017; 45 ref-7 B Zeeberg (ref-2) 2004; 5 A Culhane (ref-12) 2010; 38 M McCabe (ref-11) 2019; 9 C Bult (ref-6) 2019; 47 L Waldron (ref-13) 2014; 106 A Liberzon (ref-9) 2011; 27 M Ziemann (ref-3) 2016; 17 M Ritchie (ref-10) 2020; 43 E Bruford (ref-4) 2020; 52 A Poux (ref-1) 2002; 99 |
References_xml | – volume: 45 start-page: D619-D625 year: 2017 ident: ref-5 article-title: Genenames.org: the HGNC and VGNC resources in 2017. publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkw1033 – volume: 5 start-page: 80 year: 2004 ident: ref-2 article-title: Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. publication-title: BMC Bioinformatics. doi: 10.1186/1471-2105-5-80 – volume: 106 year: 2014 ident: ref-13 article-title: Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer. publication-title: J Natl Cancer Inst. doi: 10.1093/jnci/dju049 – volume: 99 start-page: 14065-70 year: 2002 ident: ref-1 article-title: Structure of the GCN5 histone acetyltransferase bound to a bisubstrate inhibitor. publication-title: Proc Natl Acad Sci U S A. doi: 10.1073/pnas.222373899 – volume: 23 start-page: 1846-1847 year: 2007 ident: ref-8 article-title: GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. publication-title: Bioinformatics. doi: 10.1093/bioinformatics/btm254 – volume: 9 start-page: 17052 year: 2019 ident: ref-11 article-title: Development and validation of a targeted gene sequencing panel for application to disparate cancers. publication-title: Sci Rep. doi: 10.1038/s41598-019-52000-3 – volume: 52 start-page: 754-758 year: 2020 ident: ref-4 article-title: Guidelines for human gene nomenclature. publication-title: Nat Genet. doi: 10.1038/s41588-020-0669-3 – ident: ref-7 article-title: Home | HUGO Gene Nomenclature Committee. – volume: 43 start-page: e47 year: 2020 ident: ref-10 article-title: limma powers differential expression analyses for RNA-sequencing and microarray studies. publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkv007 – volume: 17 start-page: 177 year: 2016 ident: ref-3 article-title: Gene name errors are widespread in the scientific literature. publication-title: Genome Biol. doi: 10.1186/s13059-016-1044-7 – volume: 27 start-page: 1739-1740 year: 2011 ident: ref-9 article-title: Molecular signatures database (MSigDB) 3.0. publication-title: Bioinformatics. doi: 10.1093/bioinformatics/btr260 – volume: 47 start-page: D801-D806 year: 2019 ident: ref-6 article-title: Mouse Genome Database (MGD) 2019. publication-title: Nucleic Acids Res. doi: 10.1093/nar/gky1056 – volume: 38 start-page: D716-25 year: 2010 ident: ref-12 article-title: GeneSigDB--a curated database of gene expression signatures. publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkp1015 |
SSID | ssj0000993627 |
Score | 2.2771323 |
Snippet | Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by... |
SourceID | doaj pubmedcentral proquest pubmed crossref faculty1000 |
SourceType | Open Website Open Access Repository Aggregation Database Index Database Enrichment Source Publisher |
StartPage | 1493 |
SubjectTerms | eng gene symbols HGNC MGI molecular biology Software Tool |
SummonAdditionalLinks | – databaseName: DOAJ Directory of Open Access Journals dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Na9wwEBUlh9IcQpt-bb9QodCTE0eSLTs9taHpUkhODQRKEZI1YjdsvctuGkh_fWckZVmXwl56tS1keZ6kN2PNG8beuSArgCNbaK9VoZwWhZPCFcoLL1zj2iSkfXZejy_U18vqcqPUF50JS_LA6cMd4vaGpLm1R8o6hXu_s-iC60rpzksZdEzzLdtyw5m6SrwHV2adU4JFjW5eoEh2VtCZHFBRJnkgBrtRFO3fZbvBkurFLT3_L9759_HJjf3o9CHby0SSf0wDeMTuQb_P7p_lX-X7bC_Wpny_4vmk4GP2e_zl_GQCswUsj_nU52NC0TLc9p53VKgjpjnweeDTHjE49RwBBnx1-9PNZyuODJfHqn6xAUUNgH-_SSE3Lj7wBcCSp2yYYy55FCy_Af_jCbs4_fztZFzk0gtFh_4huqeyc0EEhc4N6FK7Ute2hbaRpa26pgNSoyTuI4TWgLPYd4gFyoH1VW1rH-RTttPPe3jOuAOkPA4ceSoqBGub2uGiSipACumPH7HqzgSmy7rkVB5jZsg_IdOZgelMNJ0RI3a4brdIyhxbW3wiC6-fJmXteAHxZjLezDa8jZjcwIdZ97Gt67d3ODI4f-mnjO0BrWRwTRW4xteVHrFnCVfrF5SyIsKIfeoB4gYjGN7pp5OoEa4b5Om6ffE_hvySPRAUZYiBp1ds53r5C14jFbt2b-Ks-wOXSS8k priority: 102 providerName: Directory of Open Access Journals |
Title | HGNChelper: identification and correction of invalid gene symbols for human and mouse [version 2; peer review: 3 approved] |
URI | http://dx.doi.org/10.12688/f1000research.28033.2 https://www.ncbi.nlm.nih.gov/pubmed/33564398 https://www.proquest.com/docview/3182763657 https://pubmed.ncbi.nlm.nih.gov/PMC7856679 https://doaj.org/article/1751749a14ab4423ba9337547cd33f78 |
Volume | 9 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3da9swED-6Fsb6ULbuo9lH0GCwJ2eJZFv2YIy1tA2DhjEWyJuQLGnJyJw26WD573cnK6EeZd2LH2zLSnwf_t3p9DuAN8aLzLmBTqSVaZIayRMjuElSyy03hSkbIu2LUT4cp58n2WQHNu1S4wtc3RraUT-p8XLe-321_ogG_yFwI-QYwXlKUkdynGmP-i2JHrrlvbBmROV8EfL_aBAR-mzaRc0xNEwGocCn--9Htb5Zgdp_H_a9Jm6MNd1_Gzr9u8jyxlfr7CEcRLjJPjX68Qh2XH0I9y_igvohHIQOlm9XLNYTPobx8Hx0MnXzS7d8z2Y2FhMF-TFdW1ZRO4-wGYItPJvVqKkzy1ANHVutf5rFfMUQB7PQ-y8MoNyCewLjs9NvJ8Mk9l5IKgwQMT4VlfHcpxjdONmXpi9zXbqyEH2dVUXliI6SwA_nUjo0Y1uhMtAmWJvlOrdePIXdelG7I2DGIeYxzlCoknqvdZEb9KpEA5Qi_rEdyDZvV1WRmJz6Y8wVBSgkFdWSigpSUbwD77bjLhtqjjtHHJPwtncTtXY4sVh-V9FSFeIpjNJKPUi1SRFsGl0K6hMsKyuEl0UHxA3Rq-0cd039eqMiCg2YVmV07VAACp0qRyefZ7IDzxqV2f5AITJCjDinbClT6x-0r9SzaSAJlwUCdVk-_495X8ADTlmEkFh6CbvXy1_uFUKta9OFe3Iiu7B3fDr68rUbEhZ4PJ8MusGq_gC4KyjT |
linkProvider | Scholars Portal |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=HGNChelper%3A+identification+and+correction+of+invalid+gene+symbols+for+human+and+mouse&rft.jtitle=F1000+research&rft.au=Oh%2C+Sehyun&rft.au=Abdelnabi%2C+Jasmine&rft.au=Al-Dulaimi%2C+Ragheed&rft.au=Aggarwal%2C+Ayush&rft.date=2020&rft.issn=2046-1402&rft.eissn=2046-1402&rft.volume=9&rft.spage=1493&rft_id=info:doi/10.12688%2Ff1000research.28033.2&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2046-1402&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2046-1402&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2046-1402&client=summon |