DataPerf: Benchmarks for Data-Centric AI Development

Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and...

Full description

Saved in:
Bibliographic Details
Main Authors Mazumder, Mark, Banbury, Colby, Yao, Xiaozhe, Karlaš, Bojan, Rojas, William Gaviria, Diamos, Sudnya, Diamos, Greg, He, Lynn, Parrish, Alicia, Kirk, Hannah Rose, Quaye, Jessica, Rastogi, Charvi, Kiela, Douwe, Jurado, David, Kanter, David, Mosquera, Rafael, Ciro, Juan, Aroyo, Lora, Acun, Bilge, Chen, Lingjiao, Raje, Mehul Smriti, Bartolo, Max, Eyuboglu, Sabri, Ghorbani, Amirata, Goodman, Emmett, Inel, Oana, Kane, Tariq, Kirkpatrick, Christine R, Kuo, Tzu-Sheng, Mueller, Jonas, Thrush, Tristan, Vanschoren, Joaquin, Warren, Margaret, Williams, Adina, Yeung, Serena, Ardalani, Newsha, Paritosh, Praveen, Bat-Leah, Lilith, Zhang, Ce, Zou, James, Wu, Carole-Jean, Coleman, Cody, Ng, Andrew, Mattson, Peter, Reddi, Vijay Janapa
Format Journal Article
LanguageEnglish
Published 20.07.2022
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2207.10062

Cover

Abstract Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
AbstractList Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
Author Ardalani, Newsha
Mueller, Jonas
Wu, Carole-Jean
Williams, Adina
Kirkpatrick, Christine R
Kiela, Douwe
Kanter, David
Inel, Oana
Vanschoren, Joaquin
Chen, Lingjiao
Parrish, Alicia
Ciro, Juan
Acun, Bilge
Rastogi, Charvi
Ng, Andrew
Diamos, Sudnya
Thrush, Tristan
He, Lynn
Ghorbani, Amirata
Bat-Leah, Lilith
Rojas, William Gaviria
Kane, Tariq
Kirk, Hannah Rose
Warren, Margaret
Banbury, Colby
Quaye, Jessica
Zou, James
Jurado, David
Aroyo, Lora
Diamos, Greg
Bartolo, Max
Coleman, Cody
Raje, Mehul Smriti
Yeung, Serena
Karlaš, Bojan
Goodman, Emmett
Mattson, Peter
Mosquera, Rafael
Eyuboglu, Sabri
Mazumder, Mark
Zhang, Ce
Yao, Xiaozhe
Paritosh, Praveen
Kuo, Tzu-Sheng
Reddi, Vijay Janapa
Author_xml – sequence: 1
  givenname: Mark
  surname: Mazumder
  fullname: Mazumder, Mark
– sequence: 2
  givenname: Colby
  surname: Banbury
  fullname: Banbury, Colby
– sequence: 3
  givenname: Xiaozhe
  surname: Yao
  fullname: Yao, Xiaozhe
– sequence: 4
  givenname: Bojan
  surname: Karlaš
  fullname: Karlaš, Bojan
– sequence: 5
  givenname: William Gaviria
  surname: Rojas
  fullname: Rojas, William Gaviria
– sequence: 6
  givenname: Sudnya
  surname: Diamos
  fullname: Diamos, Sudnya
– sequence: 7
  givenname: Greg
  surname: Diamos
  fullname: Diamos, Greg
– sequence: 8
  givenname: Lynn
  surname: He
  fullname: He, Lynn
– sequence: 9
  givenname: Alicia
  surname: Parrish
  fullname: Parrish, Alicia
– sequence: 10
  givenname: Hannah Rose
  surname: Kirk
  fullname: Kirk, Hannah Rose
– sequence: 11
  givenname: Jessica
  surname: Quaye
  fullname: Quaye, Jessica
– sequence: 12
  givenname: Charvi
  surname: Rastogi
  fullname: Rastogi, Charvi
– sequence: 13
  givenname: Douwe
  surname: Kiela
  fullname: Kiela, Douwe
– sequence: 14
  givenname: David
  surname: Jurado
  fullname: Jurado, David
– sequence: 15
  givenname: David
  surname: Kanter
  fullname: Kanter, David
– sequence: 16
  givenname: Rafael
  surname: Mosquera
  fullname: Mosquera, Rafael
– sequence: 17
  givenname: Juan
  surname: Ciro
  fullname: Ciro, Juan
– sequence: 18
  givenname: Lora
  surname: Aroyo
  fullname: Aroyo, Lora
– sequence: 19
  givenname: Bilge
  surname: Acun
  fullname: Acun, Bilge
– sequence: 20
  givenname: Lingjiao
  surname: Chen
  fullname: Chen, Lingjiao
– sequence: 21
  givenname: Mehul Smriti
  surname: Raje
  fullname: Raje, Mehul Smriti
– sequence: 22
  givenname: Max
  surname: Bartolo
  fullname: Bartolo, Max
– sequence: 23
  givenname: Sabri
  surname: Eyuboglu
  fullname: Eyuboglu, Sabri
– sequence: 24
  givenname: Amirata
  surname: Ghorbani
  fullname: Ghorbani, Amirata
– sequence: 25
  givenname: Emmett
  surname: Goodman
  fullname: Goodman, Emmett
– sequence: 26
  givenname: Oana
  surname: Inel
  fullname: Inel, Oana
– sequence: 27
  givenname: Tariq
  surname: Kane
  fullname: Kane, Tariq
– sequence: 28
  givenname: Christine R
  surname: Kirkpatrick
  fullname: Kirkpatrick, Christine R
– sequence: 29
  givenname: Tzu-Sheng
  surname: Kuo
  fullname: Kuo, Tzu-Sheng
– sequence: 30
  givenname: Jonas
  surname: Mueller
  fullname: Mueller, Jonas
– sequence: 31
  givenname: Tristan
  surname: Thrush
  fullname: Thrush, Tristan
– sequence: 32
  givenname: Joaquin
  surname: Vanschoren
  fullname: Vanschoren, Joaquin
– sequence: 33
  givenname: Margaret
  surname: Warren
  fullname: Warren, Margaret
– sequence: 34
  givenname: Adina
  surname: Williams
  fullname: Williams, Adina
– sequence: 35
  givenname: Serena
  surname: Yeung
  fullname: Yeung, Serena
– sequence: 36
  givenname: Newsha
  surname: Ardalani
  fullname: Ardalani, Newsha
– sequence: 37
  givenname: Praveen
  surname: Paritosh
  fullname: Paritosh, Praveen
– sequence: 38
  givenname: Lilith
  surname: Bat-Leah
  fullname: Bat-Leah, Lilith
– sequence: 39
  givenname: Ce
  surname: Zhang
  fullname: Zhang, Ce
– sequence: 40
  givenname: James
  surname: Zou
  fullname: Zou, James
– sequence: 41
  givenname: Carole-Jean
  surname: Wu
  fullname: Wu, Carole-Jean
– sequence: 42
  givenname: Cody
  surname: Coleman
  fullname: Coleman, Cody
– sequence: 43
  givenname: Andrew
  surname: Ng
  fullname: Ng, Andrew
– sequence: 44
  givenname: Peter
  surname: Mattson
  fullname: Mattson, Peter
– sequence: 45
  givenname: Vijay Janapa
  surname: Reddi
  fullname: Reddi, Vijay Janapa
BackLink https://doi.org/10.48550/arXiv.2207.10062$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIyMNczNDAwM-JkMHFJLEkMSC1Ks1JwSs1LzshNLMouVkjLL1IASeg6p-aVFGUmKzh6KriklqXm5BfkAkV4GFjTEnOKU3mhNDeDvJtriLOHLtj8-IKiTKAxlfEge-LB9hgTVgEA5pAzBA
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2207.10062
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2207_10062
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2207_100623
IEDL.DBID GOX
IngestDate Tue Jul 22 23:40:02 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2207_100623
OpenAccessLink https://arxiv.org/abs/2207.10062
ParticipantIDs arxiv_primary_2207_10062
PublicationCentury 2000
PublicationDate 2022-07-20
PublicationDateYYYYMMDD 2022-07-20
PublicationDate_xml – month: 07
  year: 2022
  text: 2022-07-20
  day: 20
PublicationDecade 2020
PublicationYear 2022
Score 3.6083705
SecondaryResourceType preprint
Snippet Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth,...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Learning
Title DataPerf: Benchmarks for Data-Centric AI Development
URI https://arxiv.org/abs/2207.10062
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVxLT8MwDLa2nbggEKDxzoFrRNdkbcNtPKaBxOMAUm-Vk9piQiDUFcTPJ0mH2GXX2Eqs2JK_OP4McDYmHhlmkiNSSmo0KC2TkehcgjVnznBgI98_ZLMXfVeOyx6IPy4MNj_z724-sF2cp7H-EXh-feh7oBDIvI9l9zkZR3Et9f_1PMaMSytJYroFm0t0JyadO7ahRx87oK-xxSdq-EJc-ph4fcfmbSE8WBRBIGN9de7E5FasdPDswun05vlqJuM51Wc3FKIKJlTRBLUHA_90pyGILMlrU3BNOk-1LdhaVxAhZZoUK2f2Ybhul4P1okPYSEMTfpL7GD-CQdt80bFPja09iffzC24OaSE
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DataPerf%3A+Benchmarks+for+Data-Centric+AI+Development&rft.au=Mazumder%2C+Mark&rft.au=Banbury%2C+Colby&rft.au=Yao%2C+Xiaozhe&rft.au=Karla%C5%A1%2C+Bojan&rft.date=2022-07-20&rft_id=info:doi/10.48550%2Farxiv.2207.10062&rft.externalDocID=2207_10062