DataPerf: Benchmarks for Data-Centric AI Development
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and...
Saved in:
| Main Authors | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
20.07.2022
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2207.10062 |
Cover
| Abstract | Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry. |
|---|---|
| AbstractList | Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry. |
| Author | Ardalani, Newsha Mueller, Jonas Wu, Carole-Jean Williams, Adina Kirkpatrick, Christine R Kiela, Douwe Kanter, David Inel, Oana Vanschoren, Joaquin Chen, Lingjiao Parrish, Alicia Ciro, Juan Acun, Bilge Rastogi, Charvi Ng, Andrew Diamos, Sudnya Thrush, Tristan He, Lynn Ghorbani, Amirata Bat-Leah, Lilith Rojas, William Gaviria Kane, Tariq Kirk, Hannah Rose Warren, Margaret Banbury, Colby Quaye, Jessica Zou, James Jurado, David Aroyo, Lora Diamos, Greg Bartolo, Max Coleman, Cody Raje, Mehul Smriti Yeung, Serena Karlaš, Bojan Goodman, Emmett Mattson, Peter Mosquera, Rafael Eyuboglu, Sabri Mazumder, Mark Zhang, Ce Yao, Xiaozhe Paritosh, Praveen Kuo, Tzu-Sheng Reddi, Vijay Janapa |
| Author_xml | – sequence: 1 givenname: Mark surname: Mazumder fullname: Mazumder, Mark – sequence: 2 givenname: Colby surname: Banbury fullname: Banbury, Colby – sequence: 3 givenname: Xiaozhe surname: Yao fullname: Yao, Xiaozhe – sequence: 4 givenname: Bojan surname: Karlaš fullname: Karlaš, Bojan – sequence: 5 givenname: William Gaviria surname: Rojas fullname: Rojas, William Gaviria – sequence: 6 givenname: Sudnya surname: Diamos fullname: Diamos, Sudnya – sequence: 7 givenname: Greg surname: Diamos fullname: Diamos, Greg – sequence: 8 givenname: Lynn surname: He fullname: He, Lynn – sequence: 9 givenname: Alicia surname: Parrish fullname: Parrish, Alicia – sequence: 10 givenname: Hannah Rose surname: Kirk fullname: Kirk, Hannah Rose – sequence: 11 givenname: Jessica surname: Quaye fullname: Quaye, Jessica – sequence: 12 givenname: Charvi surname: Rastogi fullname: Rastogi, Charvi – sequence: 13 givenname: Douwe surname: Kiela fullname: Kiela, Douwe – sequence: 14 givenname: David surname: Jurado fullname: Jurado, David – sequence: 15 givenname: David surname: Kanter fullname: Kanter, David – sequence: 16 givenname: Rafael surname: Mosquera fullname: Mosquera, Rafael – sequence: 17 givenname: Juan surname: Ciro fullname: Ciro, Juan – sequence: 18 givenname: Lora surname: Aroyo fullname: Aroyo, Lora – sequence: 19 givenname: Bilge surname: Acun fullname: Acun, Bilge – sequence: 20 givenname: Lingjiao surname: Chen fullname: Chen, Lingjiao – sequence: 21 givenname: Mehul Smriti surname: Raje fullname: Raje, Mehul Smriti – sequence: 22 givenname: Max surname: Bartolo fullname: Bartolo, Max – sequence: 23 givenname: Sabri surname: Eyuboglu fullname: Eyuboglu, Sabri – sequence: 24 givenname: Amirata surname: Ghorbani fullname: Ghorbani, Amirata – sequence: 25 givenname: Emmett surname: Goodman fullname: Goodman, Emmett – sequence: 26 givenname: Oana surname: Inel fullname: Inel, Oana – sequence: 27 givenname: Tariq surname: Kane fullname: Kane, Tariq – sequence: 28 givenname: Christine R surname: Kirkpatrick fullname: Kirkpatrick, Christine R – sequence: 29 givenname: Tzu-Sheng surname: Kuo fullname: Kuo, Tzu-Sheng – sequence: 30 givenname: Jonas surname: Mueller fullname: Mueller, Jonas – sequence: 31 givenname: Tristan surname: Thrush fullname: Thrush, Tristan – sequence: 32 givenname: Joaquin surname: Vanschoren fullname: Vanschoren, Joaquin – sequence: 33 givenname: Margaret surname: Warren fullname: Warren, Margaret – sequence: 34 givenname: Adina surname: Williams fullname: Williams, Adina – sequence: 35 givenname: Serena surname: Yeung fullname: Yeung, Serena – sequence: 36 givenname: Newsha surname: Ardalani fullname: Ardalani, Newsha – sequence: 37 givenname: Praveen surname: Paritosh fullname: Paritosh, Praveen – sequence: 38 givenname: Lilith surname: Bat-Leah fullname: Bat-Leah, Lilith – sequence: 39 givenname: Ce surname: Zhang fullname: Zhang, Ce – sequence: 40 givenname: James surname: Zou fullname: Zou, James – sequence: 41 givenname: Carole-Jean surname: Wu fullname: Wu, Carole-Jean – sequence: 42 givenname: Cody surname: Coleman fullname: Coleman, Cody – sequence: 43 givenname: Andrew surname: Ng fullname: Ng, Andrew – sequence: 44 givenname: Peter surname: Mattson fullname: Mattson, Peter – sequence: 45 givenname: Vijay Janapa surname: Reddi fullname: Reddi, Vijay Janapa |
| BackLink | https://doi.org/10.48550/arXiv.2207.10062$$DView paper in arXiv |
| BookMark | eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIyMNczNDAwM-JkMHFJLEkMSC1Ks1JwSs1LzshNLMouVkjLL1IASeg6p-aVFGUmKzh6KriklqXm5BfkAkV4GFjTEnOKU3mhNDeDvJtriLOHLtj8-IKiTKAxlfEge-LB9hgTVgEA5pAzBA |
| ContentType | Journal Article |
| Copyright | http://creativecommons.org/licenses/by/4.0 |
| Copyright_xml | – notice: http://creativecommons.org/licenses/by/4.0 |
| DBID | AKY GOX |
| DOI | 10.48550/arxiv.2207.10062 |
| DatabaseName | arXiv Computer Science arXiv.org |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
| DeliveryMethod | fulltext_linktorsrc |
| ExternalDocumentID | 2207_10062 |
| GroupedDBID | AKY GOX |
| ID | FETCH-arxiv_primary_2207_100623 |
| IEDL.DBID | GOX |
| IngestDate | Tue Jul 22 23:40:02 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-arxiv_primary_2207_100623 |
| OpenAccessLink | https://arxiv.org/abs/2207.10062 |
| ParticipantIDs | arxiv_primary_2207_10062 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-07-20 |
| PublicationDateYYYYMMDD | 2022-07-20 |
| PublicationDate_xml | – month: 07 year: 2022 text: 2022-07-20 day: 20 |
| PublicationDecade | 2020 |
| PublicationYear | 2022 |
| Score | 3.6083705 |
| SecondaryResourceType | preprint |
| Snippet | Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth,... |
| SourceID | arxiv |
| SourceType | Open Access Repository |
| SubjectTerms | Computer Science - Learning |
| Title | DataPerf: Benchmarks for Data-Centric AI Development |
| URI | https://arxiv.org/abs/2207.10062 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVxLT8MwDLa2nbggEKDxzoFrRNdkbcNtPKaBxOMAUm-Vk9piQiDUFcTPJ0mH2GXX2Eqs2JK_OP4McDYmHhlmkiNSSmo0KC2TkehcgjVnznBgI98_ZLMXfVeOyx6IPy4MNj_z724-sF2cp7H-EXh-feh7oBDIvI9l9zkZR3Et9f_1PMaMSytJYroFm0t0JyadO7ahRx87oK-xxSdq-EJc-ph4fcfmbSE8WBRBIGN9de7E5FasdPDswun05vlqJuM51Wc3FKIKJlTRBLUHA_90pyGILMlrU3BNOk-1LdhaVxAhZZoUK2f2Ybhul4P1okPYSEMTfpL7GD-CQdt80bFPja09iffzC24OaSE |
| linkProvider | Cornell University |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DataPerf%3A+Benchmarks+for+Data-Centric+AI+Development&rft.au=Mazumder%2C+Mark&rft.au=Banbury%2C+Colby&rft.au=Yao%2C+Xiaozhe&rft.au=Karla%C5%A1%2C+Bojan&rft.date=2022-07-20&rft_id=info:doi/10.48550%2Farxiv.2207.10062&rft.externalDocID=2207_10062 |