DataPerf: Benchmarks for Data-Centric AI Development

Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and...

Full description

Saved in:

Bibliographic Details
Main Authors	Mazumder, Mark, Banbury, Colby, Yao, Xiaozhe, Karlaš, Bojan, Rojas, William Gaviria, Diamos, Sudnya, Diamos, Greg, He, Lynn, Parrish, Alicia, Kirk, Hannah Rose, Quaye, Jessica, Rastogi, Charvi, Kiela, Douwe, Jurado, David, Kanter, David, Mosquera, Rafael, Ciro, Juan, Aroyo, Lora, Acun, Bilge, Chen, Lingjiao, Raje, Mehul Smriti, Bartolo, Max, Eyuboglu, Sabri, Ghorbani, Amirata, Goodman, Emmett, Inel, Oana, Kane, Tariq, Kirkpatrick, Christine R, Kuo, Tzu-Sheng, Mueller, Jonas, Thrush, Tristan, Vanschoren, Joaquin, Warren, Margaret, Williams, Adina, Yeung, Serena, Ardalani, Newsha, Paritosh, Praveen, Bat-Leah, Lilith, Zhang, Ce, Zou, James, Wu, Carole-Jean, Coleman, Cody, Ng, Andrew, Mattson, Peter, Reddi, Vijay Janapa
Format	Journal Article
Language	English
Published	20.07.2022
Subjects	Computer Science - Learning
Online Access	Get full text
DOI	10.48550/arxiv.2207.10062

Cover

Abstract	Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
AbstractList	Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
Author	Ardalani, Newsha Mueller, Jonas Wu, Carole-Jean Williams, Adina Kirkpatrick, Christine R Kiela, Douwe Kanter, David Inel, Oana Vanschoren, Joaquin Chen, Lingjiao Parrish, Alicia Ciro, Juan Acun, Bilge Rastogi, Charvi Ng, Andrew Diamos, Sudnya Thrush, Tristan He, Lynn Ghorbani, Amirata Bat-Leah, Lilith Rojas, William Gaviria Kane, Tariq Kirk, Hannah Rose Warren, Margaret Banbury, Colby Quaye, Jessica Zou, James Jurado, David Aroyo, Lora Diamos, Greg Bartolo, Max Coleman, Cody Raje, Mehul Smriti Yeung, Serena Karlaš, Bojan Goodman, Emmett Mattson, Peter Mosquera, Rafael Eyuboglu, Sabri Mazumder, Mark Zhang, Ce Yao, Xiaozhe Paritosh, Praveen Kuo, Tzu-Sheng Reddi, Vijay Janapa
Author_xml	– sequence: 1 givenname: Mark surname: Mazumder fullname: Mazumder, Mark – sequence: 2 givenname: Colby surname: Banbury fullname: Banbury, Colby – sequence: 3 givenname: Xiaozhe surname: Yao fullname: Yao, Xiaozhe – sequence: 4 givenname: Bojan surname: Karlaš fullname: Karlaš, Bojan – sequence: 5 givenname: William Gaviria surname: Rojas fullname: Rojas, William Gaviria – sequence: 6 givenname: Sudnya surname: Diamos fullname: Diamos, Sudnya – sequence: 7 givenname: Greg surname: Diamos fullname: Diamos, Greg – sequence: 8 givenname: Lynn surname: He fullname: He, Lynn – sequence: 9 givenname: Alicia surname: Parrish fullname: Parrish, Alicia – sequence: 10 givenname: Hannah Rose surname: Kirk fullname: Kirk, Hannah Rose – sequence: 11 givenname: Jessica surname: Quaye fullname: Quaye, Jessica – sequence: 12 givenname: Charvi surname: Rastogi fullname: Rastogi, Charvi – sequence: 13 givenname: Douwe surname: Kiela fullname: Kiela, Douwe – sequence: 14 givenname: David surname: Jurado fullname: Jurado, David – sequence: 15 givenname: David surname: Kanter fullname: Kanter, David – sequence: 16 givenname: Rafael surname: Mosquera fullname: Mosquera, Rafael – sequence: 17 givenname: Juan surname: Ciro fullname: Ciro, Juan – sequence: 18 givenname: Lora surname: Aroyo fullname: Aroyo, Lora – sequence: 19 givenname: Bilge surname: Acun fullname: Acun, Bilge – sequence: 20 givenname: Lingjiao surname: Chen fullname: Chen, Lingjiao – sequence: 21 givenname: Mehul Smriti surname: Raje fullname: Raje, Mehul Smriti – sequence: 22 givenname: Max surname: Bartolo fullname: Bartolo, Max – sequence: 23 givenname: Sabri surname: Eyuboglu fullname: Eyuboglu, Sabri – sequence: 24 givenname: Amirata surname: Ghorbani fullname: Ghorbani, Amirata – sequence: 25 givenname: Emmett surname: Goodman fullname: Goodman, Emmett – sequence: 26 givenname: Oana surname: Inel fullname: Inel, Oana – sequence: 27 givenname: Tariq surname: Kane fullname: Kane, Tariq – sequence: 28 givenname: Christine R surname: Kirkpatrick fullname: Kirkpatrick, Christine R – sequence: 29 givenname: Tzu-Sheng surname: Kuo fullname: Kuo, Tzu-Sheng – sequence: 30 givenname: Jonas surname: Mueller fullname: Mueller, Jonas – sequence: 31 givenname: Tristan surname: Thrush fullname: Thrush, Tristan – sequence: 32 givenname: Joaquin surname: Vanschoren fullname: Vanschoren, Joaquin – sequence: 33 givenname: Margaret surname: Warren fullname: Warren, Margaret – sequence: 34 givenname: Adina surname: Williams fullname: Williams, Adina – sequence: 35 givenname: Serena surname: Yeung fullname: Yeung, Serena – sequence: 36 givenname: Newsha surname: Ardalani fullname: Ardalani, Newsha – sequence: 37 givenname: Praveen surname: Paritosh fullname: Paritosh, Praveen – sequence: 38 givenname: Lilith surname: Bat-Leah fullname: Bat-Leah, Lilith – sequence: 39 givenname: Ce surname: Zhang fullname: Zhang, Ce – sequence: 40 givenname: James surname: Zou fullname: Zou, James – sequence: 41 givenname: Carole-Jean surname: Wu fullname: Wu, Carole-Jean – sequence: 42 givenname: Cody surname: Coleman fullname: Coleman, Cody – sequence: 43 givenname: Andrew surname: Ng fullname: Ng, Andrew – sequence: 44 givenname: Peter surname: Mattson fullname: Mattson, Peter – sequence: 45 givenname: Vijay Janapa surname: Reddi fullname: Reddi, Vijay Janapa
BackLink	https://doi.org/10.48550/arXiv.2207.10062$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIyMNczNDAwM-JkMHFJLEkMSC1Ks1JwSs1LzshNLMouVkjLL1IASeg6p-aVFGUmKzh6KriklqXm5BfkAkV4GFjTEnOKU3mhNDeDvJtriLOHLtj8-IKiTKAxlfEge-LB9hgTVgEA5pAzBA
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2207.10062
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2207_10062
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2207_100623
IEDL.DBID	GOX
IngestDate	Tue Jul 22 23:40:02 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2207_100623
OpenAccessLink	https://arxiv.org/abs/2207.10062
ParticipantIDs	arxiv_primary_2207_10062
PublicationCentury	2000
PublicationDate	2022-07-20
PublicationDateYYYYMMDD	2022-07-20
PublicationDate_xml	– month: 07 year: 2022 text: 2022-07-20 day: 20
PublicationDecade	2020
PublicationYear	2022
Score	3.6083705
SecondaryResourceType	preprint
Snippet	Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth,...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Learning
Title	DataPerf: Benchmarks for Data-Centric AI Development
URI	https://arxiv.org/abs/2207.10062
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVxLT8MwDLa2nbggEKDxzoFrRNdkbcNtPKaBxOMAUm-Vk9piQiDUFcTPJ0mH2GXX2Eqs2JK_OP4McDYmHhlmkiNSSmo0KC2TkehcgjVnznBgI98_ZLMXfVeOyx6IPy4MNj_z724-sF2cp7H-EXh-feh7oBDIvI9l9zkZR3Et9f_1PMaMSytJYroFm0t0JyadO7ahRx87oK-xxSdq-EJc-ph4fcfmbSE8WBRBIGN9de7E5FasdPDswun05vlqJuM51Wc3FKIKJlTRBLUHA_90pyGILMlrU3BNOk-1LdhaVxAhZZoUK2f2Ybhul4P1okPYSEMTfpL7GD-CQdt80bFPja09iffzC24OaSE
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DataPerf%3A+Benchmarks+for+Data-Centric+AI+Development&rft.au=Mazumder%2C+Mark&rft.au=Banbury%2C+Colby&rft.au=Yao%2C+Xiaozhe&rft.au=Karla%C5%A1%2C+Bojan&rft.date=2022-07-20&rft_id=info:doi/10.48550%2Farxiv.2207.10062&rft.externalDocID=2207_10062