Ten quick tips for sequence-based prediction of protein properties using machine learning
The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have no...
        Saved in:
      
    
          | Published in | PLoS computational biology Vol. 18; no. 12 | 
|---|---|
| Main Authors | , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
            Public Library of Science (PLoS)
    
        01.12.2022
     | 
| Online Access | Get full text | 
| ISSN | 1553-734X 1553-7358  | 
| DOI | 10.1371/journal.pcbi.1010669 | 
Cover
| Abstract | The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead. | 
    
|---|---|
| AbstractList | The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead. | 
    
| Author | K. Anton Feenstra Katharina Waury Qingzhen Hou Dea Gogishvili  | 
    
| Author_xml | – sequence: 1 fullname: Qingzhen Hou – sequence: 2 fullname: Katharina Waury – sequence: 3 fullname: Dea Gogishvili – sequence: 4 fullname: K. Anton Feenstra  | 
    
| BookMark | eNo9jslOwzAYhC0EEm3hDTj4BVK8JzmiiqVSJS5FglPk5XdxSe3UTg-8PWERp2_mO4xmjs5jioDQDSVLymt6u0-nHHW_HKwJS0ooUao9QzMqJa9qLpvz_yxeL9G8lD0hk27VDL1tIeLjKdgPPIahYJ8yLnA8QbRQGV3A4SGDC3YMKeLkp5ZGCPGbA-QxQMGnEuIOH7R9DxFwDzrHSVyhC6_7Atd_XKCXh_vt6qnaPD-uV3ebylHCaOV5q5hRTlorvGhNXTNuGxDW-Fo1vmWyEZ6AI7VV3EhuTQuWGAXGq0YKyxdo_bvrkt53Qw4HnT-7pEP3I1LedXr6aXvouBba0JpJQ6gwhGneSMWgZcp5ap3gX8lfZyk | 
    
| ContentType | Journal Article | 
    
| DBID | DOA | 
    
| DOI | 10.1371/journal.pcbi.1010669 | 
    
| DatabaseName | DOAJ Directory of Open Access Journals | 
    
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Biology | 
    
| EISSN | 1553-7358 | 
    
| ExternalDocumentID | oai_doaj_org_article_3a4ab1725b014b02a38562e926df1cd4 | 
    
| GroupedDBID | --- 123 29O 2WC 53G 5VS 7X7 88E 8FE 8FG 8FH 8FI 8FJ AAFWJ AAKPC AAUCC AAWOE ABDBF ABUWG ACGFO ACIHN ACIWK ACPRK ACUHS ADBBV AEAQA AENEX AEUYN AFKRA AFPKN AFRAH AHMBA ALMA_UNASSIGNED_HOLDINGS AOIJS ARAPS AZQEC B0M BAWUL BBNVY BCNDV BENPR BGLVJ BHPHI BPHCQ BVXVI BWKFM CCPQU CS3 DIK DWQXO E3Z EAP EAS EBD EBS EJD EMK EMOBN ESX F5P FPL FYUFA GNUQQ GROUPED_DOAJ GX1 HCIFZ HMCUK HYE IAO IGS INH INR ISN ISR ITC J9A K6V K7- KQ8 LK8 M1P M48 M7P O5R O5S OK1 OVT P2P P62 PHGZM PHGZT PIMPY PJZUB PPXIY PQGLB PQQKQ PROAC PSQYO PV9 RNS RPM RZL SV3 TR2 TUS UKHRP WOW XSB ~8M  | 
    
| ID | FETCH-LOGICAL-d1021-f3962b6d5cc4f49b7723c8e4cbf768f92584f0ed07c63b53cb9ec0b6ebf6854c3 | 
    
| IEDL.DBID | DOA | 
    
| ISSN | 1553-734X | 
    
| IngestDate | Fri Oct 03 12:30:15 EDT 2025 | 
    
| IsDoiOpenAccess | true | 
    
| IsOpenAccess | true | 
    
| IsPeerReviewed | true | 
    
| IsScholarly | true | 
    
| Issue | 12 | 
    
| Language | English | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-d1021-f3962b6d5cc4f49b7723c8e4cbf768f92584f0ed07c63b53cb9ec0b6ebf6854c3 | 
    
| OpenAccessLink | https://doaj.org/article/3a4ab1725b014b02a38562e926df1cd4 | 
    
| ParticipantIDs | doaj_primary_oai_doaj_org_article_3a4ab1725b014b02a38562e926df1cd4 | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 2022-12-01 | 
    
| PublicationDateYYYYMMDD | 2022-12-01 | 
    
| PublicationDate_xml | – month: 12 year: 2022 text: 2022-12-01 day: 01  | 
    
| PublicationDecade | 2020 | 
    
| PublicationTitle | PLoS computational biology | 
    
| PublicationYear | 2022 | 
    
| Publisher | Public Library of Science (PLoS) | 
    
| Publisher_xml | – name: Public Library of Science (PLoS) | 
    
| SSID | ssj0035896 | 
    
| Score | 2.388585 | 
    
| Snippet | The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from... | 
    
| SourceID | doaj | 
    
| SourceType | Open Website | 
    
| SummonAdditionalLinks | – databaseName: Scholars Portal Journals: Open Access dbid: M48 link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA61IngRn_gmB68p281jNwcRFUsR6qmFelo2r7Ko2-22BfvvnaRbQdCbp0Bg5jBJ-L4hM_MhdMOcdSnQcJIC1hEWG0qUhMRVAFWAB5LzxPgG58GL6I_Y85iPW2ij2doEcP5rauf1pEb1e-dztrqDB38bVBuS7saoU2lV-GwUUFRuoW3AKunFHAbs-1-B8jQodnmxHJJQNm6a6f7y8mOQf0Cc3j7aa6givl-f7QFq2fIQ7azFI1dH6HVoSzxbFvoNL4pqjoF94k1hNPHgZHBV-28YH3o8dTiMZChKv1a-mtrOsa96n-CPUFBpcaMgMTlGo97T8LFPGqEEYrwyN3FUilgJw7VmjkkFjJnq1DKtHGQTTsbAMlxkTZRoQRWnWkmrIyWsciLlTNMT1C6npT1FuBvrhBsJdiphMk9yYDzOj4EDB2Cen6EHH5GsWs_CyPx06rAxrSdZc9kzmrNcATPiChIwFcU5TYFmWRkL47rasPP_cHKBdmPfihBKSy5Re1Ev7RUQhIW6Dmf-Bbd6uws priority: 102 providerName: Scholars Portal  | 
    
| Title | Ten quick tips for sequence-based prediction of protein properties using machine learning | 
    
| URI | https://doaj.org/article/3a4ab1725b014b02a38562e926df1cd4 | 
    
| Volume | 18 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAFT databaseName: Open Access Digital Library customDbUrl: eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: KQ8 dateStart: 20050101 isFulltext: true titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html providerName: Colorado Alliance of Research Libraries – providerCode: PRVAFT databaseName: Open Access Digital Library customDbUrl: eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: KQ8 dateStart: 20050601 isFulltext: true titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html providerName: Colorado Alliance of Research Libraries – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: DOA dateStart: 20050101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVEBS databaseName: EBSCOhost Academic Search Ultimate customDbUrl: https://search.ebscohost.com/login.aspx?authtype=ip,shib&custid=s3936755&profile=ehost&defaultdb=asn eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: ABDBF dateStart: 20050701 isFulltext: true titleUrlDefault: https://search.ebscohost.com/direct.asp?db=asn providerName: EBSCOhost – providerCode: PRVBFR databaseName: Free Medical Journals customDbUrl: eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: DIK dateStart: 20050101 isFulltext: true titleUrlDefault: http://www.freemedicaljournals.com providerName: Flying Publisher – providerCode: PRVFQY databaseName: GFMER Free Medical Journals customDbUrl: eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: GX1 dateStart: 20050101 isFulltext: true titleUrlDefault: http://www.gfmer.ch/Medical_journals/Free_medical.php providerName: Geneva Foundation for Medical Education and Research – providerCode: PRVAQN databaseName: PubMed Central customDbUrl: eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: RPM dateStart: 20050101 isFulltext: true titleUrlDefault: https://www.ncbi.nlm.nih.gov/pmc/ providerName: National Library of Medicine – providerCode: PRVPQU databaseName: Health & Medical Collection customDbUrl: eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: 7X7 dateStart: 20050601 isFulltext: true titleUrlDefault: https://search.proquest.com/healthcomplete providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: http://www.proquest.com/pqcentral?accountid=15518 eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: BENPR dateStart: 20050601 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Technology Collection customDbUrl: eissn: 1553-7358 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: 8FG dateStart: 20050601 isFulltext: true titleUrlDefault: https://search.proquest.com/technologycollection1 providerName: ProQuest – providerCode: PRVFZP databaseName: Scholars Portal Journals: Open Access customDbUrl: eissn: 1553-7358 dateEnd: 20250930 omitProxy: true ssIdentifier: ssj0035896 issn: 1553-734X databaseCode: M48 dateStart: 20050601 isFulltext: true titleUrlDefault: http://journals.scholarsportal.info providerName: Scholars Portal  | 
    
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LS8NAEF6kIngRn_gse_C6Ns0-snts1VKEemqhnkL2VYKYxj4O_ntnNynoyYuXBAK7hBk28w355vsQumfeeQkwnEiodYSllhKtoHEVABXggBQ8s2HAefIqxjP2MufzH1ZfgRPWyAM3gevRghUaqizXAOZ1khZUQsl2KhXW942NSqCJVLtmqvkGUy6jM1cwxSEZZfN2aI5m_V6bo4fa6DL0rlBz1S_B_lhZRsfoqIWEeNC8ygnac9UpOmhMIr_O0NvUVfhzW5p3vCnrNQaUiXcEaBKKkMX1KvxuCSHGS4-j9EJZhXsdWNNujQO7fYE_InHS4dYpYnGOZqPn6eOYtIYIxAYHbuKpEqkWlhvDPFMakDE10jGjPXQNXqWAJnzibJIZQTWnRitnEi2c9kJyZugF6lTLyl0i3E9Nxq2CdTpjqsgKQDY-yL3BBrC8uELDEJG8bjQv8qBCHR9AbvI2N_lfubn-j01u0GEaRg4iheQWdTarrbsDILDRXbQ_GD4NR92Ye7hOmPwGa2Ky4w | 
    
| linkProvider | Directory of Open Access Journals | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Ten+quick+tips+for+sequence-based+prediction+of+protein+properties+using+machine+learning&rft.jtitle=PLoS+computational+biology&rft.au=Qingzhen+Hou&rft.au=Katharina+Waury&rft.au=Dea+Gogishvili&rft.au=K.+Anton+Feenstra&rft.date=2022-12-01&rft.pub=Public+Library+of+Science+%28PLoS%29&rft.issn=1553-734X&rft.eissn=1553-7358&rft.volume=18&rft.issue=12&rft_id=info:doi/10.1371%2Fjournal.pcbi.1010669&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_3a4ab1725b014b02a38562e926df1cd4 | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1553-734X&client=summon | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1553-734X&client=summon | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1553-734X&client=summon |