Ten quick tips for sequence-based prediction of protein properties using machine learning

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have no...

Full description

Saved in:
Bibliographic Details
Published inPLoS computational biology Vol. 18; no. 12
Main Authors Qingzhen Hou, Katharina Waury, Dea Gogishvili, K. Anton Feenstra
Format Journal Article
LanguageEnglish
Published Public Library of Science (PLoS) 01.12.2022
Online AccessGet full text
ISSN1553-734X
1553-7358
DOI10.1371/journal.pcbi.1010669

Cover

Abstract The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.
AbstractList The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.
Author K. Anton Feenstra
Katharina Waury
Qingzhen Hou
Dea Gogishvili
Author_xml – sequence: 1
  fullname: Qingzhen Hou
– sequence: 2
  fullname: Katharina Waury
– sequence: 3
  fullname: Dea Gogishvili
– sequence: 4
  fullname: K. Anton Feenstra
BookMark eNo9jslOwzAYhC0EEm3hDTj4BVK8JzmiiqVSJS5FglPk5XdxSe3UTg-8PWERp2_mO4xmjs5jioDQDSVLymt6u0-nHHW_HKwJS0ooUao9QzMqJa9qLpvz_yxeL9G8lD0hk27VDL1tIeLjKdgPPIahYJ8yLnA8QbRQGV3A4SGDC3YMKeLkp5ZGCPGbA-QxQMGnEuIOH7R9DxFwDzrHSVyhC6_7Atd_XKCXh_vt6qnaPD-uV3ebylHCaOV5q5hRTlorvGhNXTNuGxDW-Fo1vmWyEZ6AI7VV3EhuTQuWGAXGq0YKyxdo_bvrkt53Qw4HnT-7pEP3I1LedXr6aXvouBba0JpJQ6gwhGneSMWgZcp5ap3gX8lfZyk
ContentType Journal Article
DBID DOA
DOI 10.1371/journal.pcbi.1010669
DatabaseName DOAJ Directory of Open Access Journals
DatabaseTitleList
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 1553-7358
ExternalDocumentID oai_doaj_org_article_3a4ab1725b014b02a38562e926df1cd4
GroupedDBID ---
123
29O
2WC
53G
5VS
7X7
88E
8FE
8FG
8FH
8FI
8FJ
AAFWJ
AAKPC
AAUCC
AAWOE
ABDBF
ABUWG
ACGFO
ACIHN
ACIWK
ACPRK
ACUHS
ADBBV
AEAQA
AENEX
AEUYN
AFKRA
AFPKN
AFRAH
AHMBA
ALMA_UNASSIGNED_HOLDINGS
AOIJS
ARAPS
AZQEC
B0M
BAWUL
BBNVY
BCNDV
BENPR
BGLVJ
BHPHI
BPHCQ
BVXVI
BWKFM
CCPQU
CS3
DIK
DWQXO
E3Z
EAP
EAS
EBD
EBS
EJD
EMK
EMOBN
ESX
F5P
FPL
FYUFA
GNUQQ
GROUPED_DOAJ
GX1
HCIFZ
HMCUK
HYE
IAO
IGS
INH
INR
ISN
ISR
ITC
J9A
K6V
K7-
KQ8
LK8
M1P
M48
M7P
O5R
O5S
OK1
OVT
P2P
P62
PHGZM
PHGZT
PIMPY
PJZUB
PPXIY
PQGLB
PQQKQ
PROAC
PSQYO
PV9
RNS
RPM
RZL
SV3
TR2
TUS
UKHRP
WOW
XSB
~8M
ID FETCH-LOGICAL-d1021-f3962b6d5cc4f49b7723c8e4cbf768f92584f0ed07c63b53cb9ec0b6ebf6854c3
IEDL.DBID DOA
ISSN 1553-734X
IngestDate Fri Oct 03 12:30:15 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 12
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-d1021-f3962b6d5cc4f49b7723c8e4cbf768f92584f0ed07c63b53cb9ec0b6ebf6854c3
OpenAccessLink https://doaj.org/article/3a4ab1725b014b02a38562e926df1cd4
ParticipantIDs doaj_primary_oai_doaj_org_article_3a4ab1725b014b02a38562e926df1cd4
PublicationCentury 2000
PublicationDate 2022-12-01
PublicationDateYYYYMMDD 2022-12-01
PublicationDate_xml – month: 12
  year: 2022
  text: 2022-12-01
  day: 01
PublicationDecade 2020
PublicationTitle PLoS computational biology
PublicationYear 2022
Publisher Public Library of Science (PLoS)
Publisher_xml – name: Public Library of Science (PLoS)
SSID ssj0035896
Score 2.388585
Snippet The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from...
SourceID doaj
SourceType Open Website
SummonAdditionalLinks – databaseName: Scholars Portal Journals: Open Access
  dbid: M48
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA61IngRn_gmB68p281jNwcRFUsR6qmFelo2r7Ko2-22BfvvnaRbQdCbp0Bg5jBJ-L4hM_MhdMOcdSnQcJIC1hEWG0qUhMRVAFWAB5LzxPgG58GL6I_Y85iPW2ij2doEcP5rauf1pEb1e-dztrqDB38bVBuS7saoU2lV-GwUUFRuoW3AKunFHAbs-1-B8jQodnmxHJJQNm6a6f7y8mOQf0Cc3j7aa6givl-f7QFq2fIQ7azFI1dH6HVoSzxbFvoNL4pqjoF94k1hNPHgZHBV-28YH3o8dTiMZChKv1a-mtrOsa96n-CPUFBpcaMgMTlGo97T8LFPGqEEYrwyN3FUilgJw7VmjkkFjJnq1DKtHGQTTsbAMlxkTZRoQRWnWkmrIyWsciLlTNMT1C6npT1FuBvrhBsJdiphMk9yYDzOj4EDB2Cen6EHH5GsWs_CyPx06rAxrSdZc9kzmrNcATPiChIwFcU5TYFmWRkL47rasPP_cHKBdmPfihBKSy5Re1Ev7RUQhIW6Dmf-Bbd6uws
  priority: 102
  providerName: Scholars Portal
Title Ten quick tips for sequence-based prediction of protein properties using machine learning
URI https://doaj.org/article/3a4ab1725b014b02a38562e926df1cd4
Volume 18
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAFT
  databaseName: Open Access Digital Library
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: KQ8
  dateStart: 20050101
  isFulltext: true
  titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html
  providerName: Colorado Alliance of Research Libraries
– providerCode: PRVAFT
  databaseName: Open Access Digital Library
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: KQ8
  dateStart: 20050601
  isFulltext: true
  titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html
  providerName: Colorado Alliance of Research Libraries
– providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: DOA
  dateStart: 20050101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVEBS
  databaseName: EBSCOhost Academic Search Ultimate
  customDbUrl: https://search.ebscohost.com/login.aspx?authtype=ip,shib&custid=s3936755&profile=ehost&defaultdb=asn
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: ABDBF
  dateStart: 20050701
  isFulltext: true
  titleUrlDefault: https://search.ebscohost.com/direct.asp?db=asn
  providerName: EBSCOhost
– providerCode: PRVBFR
  databaseName: Free Medical Journals
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: DIK
  dateStart: 20050101
  isFulltext: true
  titleUrlDefault: http://www.freemedicaljournals.com
  providerName: Flying Publisher
– providerCode: PRVFQY
  databaseName: GFMER Free Medical Journals
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: GX1
  dateStart: 20050101
  isFulltext: true
  titleUrlDefault: http://www.gfmer.ch/Medical_journals/Free_medical.php
  providerName: Geneva Foundation for Medical Education and Research
– providerCode: PRVAQN
  databaseName: PubMed Central
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: RPM
  dateStart: 20050101
  isFulltext: true
  titleUrlDefault: https://www.ncbi.nlm.nih.gov/pmc/
  providerName: National Library of Medicine
– providerCode: PRVPQU
  databaseName: Health & Medical Collection
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: 7X7
  dateStart: 20050601
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/healthcomplete
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl: http://www.proquest.com/pqcentral?accountid=15518
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: BENPR
  dateStart: 20050601
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Technology Collection
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: 8FG
  dateStart: 20050601
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/technologycollection1
  providerName: ProQuest
– providerCode: PRVFZP
  databaseName: Scholars Portal Journals: Open Access
  customDbUrl:
  eissn: 1553-7358
  dateEnd: 20250930
  omitProxy: true
  ssIdentifier: ssj0035896
  issn: 1553-734X
  databaseCode: M48
  dateStart: 20050601
  isFulltext: true
  titleUrlDefault: http://journals.scholarsportal.info
  providerName: Scholars Portal
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LS8NAEF6kIngRn_gse_C6Ns0-snts1VKEemqhnkL2VYKYxj4O_ntnNynoyYuXBAK7hBk28w355vsQumfeeQkwnEiodYSllhKtoHEVABXggBQ8s2HAefIqxjP2MufzH1ZfgRPWyAM3gevRghUaqizXAOZ1khZUQsl2KhXW942NSqCJVLtmqvkGUy6jM1cwxSEZZfN2aI5m_V6bo4fa6DL0rlBz1S_B_lhZRsfoqIWEeNC8ygnac9UpOmhMIr_O0NvUVfhzW5p3vCnrNQaUiXcEaBKKkMX1KvxuCSHGS4-j9EJZhXsdWNNujQO7fYE_InHS4dYpYnGOZqPn6eOYtIYIxAYHbuKpEqkWlhvDPFMakDE10jGjPXQNXqWAJnzibJIZQTWnRitnEi2c9kJyZugF6lTLyl0i3E9Nxq2CdTpjqsgKQDY-yL3BBrC8uELDEJG8bjQv8qBCHR9AbvI2N_lfubn-j01u0GEaRg4iheQWdTarrbsDILDRXbQ_GD4NR92Ye7hOmPwGa2Ky4w
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Ten+quick+tips+for+sequence-based+prediction+of+protein+properties+using+machine+learning&rft.jtitle=PLoS+computational+biology&rft.au=Qingzhen+Hou&rft.au=Katharina+Waury&rft.au=Dea+Gogishvili&rft.au=K.+Anton+Feenstra&rft.date=2022-12-01&rft.pub=Public+Library+of+Science+%28PLoS%29&rft.issn=1553-734X&rft.eissn=1553-7358&rft.volume=18&rft.issue=12&rft_id=info:doi/10.1371%2Fjournal.pcbi.1010669&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_3a4ab1725b014b02a38562e926df1cd4
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1553-734X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1553-734X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1553-734X&client=summon