Accurately Quantifying a Billion Instances per Second

Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), dire...

Full description

Saved in:
Bibliographic Details
Published in2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) pp. 1 - 10
Main Authors Hassan, Waqar, Maletzke, Andre, Batista, Gustavo
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2020
Subjects
Online AccessGet full text
DOI10.1109/DSAA49011.2020.00012

Cover

Abstract Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), directly counts the output of a classifier. However, CC has a systematic bias that makes it increasingly misestimate the counts as the class distribution drifts away from a distribution it perfectly quantifies. This issue has motivated the development of more reliable quantification methods. Such newer methods can consistently outperform CC at the cost of a significant increase in processing requirements. Yet, for a large number of applications, quantification speed is an additional criterion that must be considered. Frequently, quantification methods need to deal with large amounts of data or fast-paced streams, as it is the case of news feeding, tweets and sensor data. In this paper, we propose Sample Mean Matching (SMM), a highly efficient algorithm able to quantify billions of data instances per second. We compare SMM to a set of 14 established and state-of-the-art quantifiers in an empirical analysis comprising 25 benchmark and real-world datasets. We show that SMM is competitive with state-of-the-art methods with no statistical difference in counting accuracy, and it is orders of magnitude faster than the vast majority of the algorithms.
AbstractList Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), directly counts the output of a classifier. However, CC has a systematic bias that makes it increasingly misestimate the counts as the class distribution drifts away from a distribution it perfectly quantifies. This issue has motivated the development of more reliable quantification methods. Such newer methods can consistently outperform CC at the cost of a significant increase in processing requirements. Yet, for a large number of applications, quantification speed is an additional criterion that must be considered. Frequently, quantification methods need to deal with large amounts of data or fast-paced streams, as it is the case of news feeding, tweets and sensor data. In this paper, we propose Sample Mean Matching (SMM), a highly efficient algorithm able to quantify billions of data instances per second. We compare SMM to a set of 14 established and state-of-the-art quantifiers in an empirical analysis comprising 25 benchmark and real-world datasets. We show that SMM is competitive with state-of-the-art methods with no statistical difference in counting accuracy, and it is orders of magnitude faster than the vast majority of the algorithms.
Author Maletzke, Andre
Hassan, Waqar
Batista, Gustavo
Author_xml – sequence: 1
  givenname: Waqar
  surname: Hassan
  fullname: Hassan, Waqar
  email: waqar@usp.br
  organization: ICMC-USP,São Carlos,Brazil
– sequence: 2
  givenname: Andre
  surname: Maletzke
  fullname: Maletzke, Andre
  email: andre.maletzke@unioeste.br
  organization: ICMC-USP,São Carlos,Brazil
– sequence: 3
  givenname: Gustavo
  surname: Batista
  fullname: Batista, Gustavo
  email: g.batista@unsw.edu.au
  organization: ICMC-USP,São Carlos,Brazil
BookMark eNotzLtOwzAUAFAjwUALXwCDfyDh3mvHjzGUV6VKqGqZK8cPZCk4VZIO-XsGmM52Vuy6DCUy9ohQI4J9ejm0rbSAWBMQ1ACAdMVWqMmgIVD2ljWt95fRzbFf-P7iypzTkss3d_w5930eCt-WaXbFx4mf48gP0Q8l3LGb5Pop3v-7Zl9vr8fNR7X7fN9u2l2VCcRcaRJKCJTWNJIarYMC8o6StNoImyBqYUKXggtotcBOkVaNR4supC5IEGv28PfmGOPpPOYfNy4nSwqAjPgF0RRBNg
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/DSAA49011.2020.00012
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1728182069
9781728182063
EndPage 10
ExternalDocumentID 9260028
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i203t-7236331498542577d602ca2f497839f0e738dbfdad19731b62765c191adfbd403
IEDL.DBID RIE
IngestDate Thu Jun 29 18:37:56 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-7236331498542577d602ca2f497839f0e738dbfdad19731b62765c191adfbd403
PageCount 10
ParticipantIDs ieee_primary_9260028
PublicationCentury 2000
PublicationDate 2020-Oct.
PublicationDateYYYYMMDD 2020-10-01
PublicationDate_xml – month: 10
  year: 2020
  text: 2020-Oct.
PublicationDecade 2020
PublicationTitle 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)
PublicationTitleAbbrev DSAA
PublicationYear 2020
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.8476894
Snippet Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Machine Learning
Mixture Methods
Prediction algorithms
Probabilistic logic
Proposals
Quantification
Systematics
Task analysis
Time complexity
Training
Title Accurately Quantifying a Billion Instances per Second
URI https://ieeexplore.ieee.org/document/9260028
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LSsNAFL20XblSacU3s3Bp2uk8k2V8lCpUlFroriTzAFHSUpKFfr1zk1pFXLgLs0lmwuTMuTnnHoALkwvJvTRRgFMZCIpOIoy0jbi3AS2Y5XmM3uHJgxrPxP1czltwufXCOOdq8Znr42X9L98uTYWlskGC3dRZ3Ia21knj1dq44YY0GdxM01SgkzKwPoaCLYoxkz8yU2rIGO3C5OtmjVLktV-Ved98_OrD-N-n2YPetzmPPG5hZx9aruiCTI2psOvD2zt5qjJUAKF_iWTkCgsqy4Lc1efA8FUgK7cmU-TBtgez0e3z9TjaJCJEL4zyMtKMK84DqYkl7jVtFWUmYx5j4njiqdM8trm3mR1iJFWumFbSBEqWWZ9bQfkBdIpl4Q6BWCGTIbNCcZoJLAYKrryIfRg1AeLjI-jilBerpunFYjPb47-HT2AHF71RuZ1Cp1xX7iygdZmf16_pE6_hkyA
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LTsJAFL1BXOhKDRjfzsKlhWEefSzxQUCBaICEHWnnkRhNS0i70K93botojAt3zWzayWTm9Nw55x6AK5UIya1UnoNT6QhKEHkYaetxqx1aMM2TEL3Do7Hfn4mHuZzX4HrjhTHGlOIz08LH8i5fZ6rAUlk7wm7qLNyCbelYRVC5tdZ-uA6N2neTblegl9LxPoaSLYpBkz9SU0rQ6O3B6Ot1lVbktVXkSUt9_OrE-N_v2Yfmtz2PPG2A5wBqJm2A7CpVYN-Ht3fyXMSoAUIHE4nJDZZUspQMyj9Bdy6QpVmRCTJh3YRZ73562_fWmQjeC6M89wLGfc4drQkl7rZA-5SpmFkMiuORpSbgoU6sjnUHQ6kSnwW-VI6UxdomWlB-CPU0S80REC1k1GFa-JzGAsuBgvtWhNaNKgfy4TE0cMqLZdX2YrGe7cnfw5ew05-OhovhYPx4Cru4AJXm7Qzq-aow5w678-SiXLJPoTqWcQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2020+IEEE+7th+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=Accurately+Quantifying+a+Billion+Instances+per+Second&rft.au=Hassan%2C+Waqar&rft.au=Maletzke%2C+Andre&rft.au=Batista%2C+Gustavo&rft.date=2020-10-01&rft.pub=IEEE&rft.spage=1&rft.epage=10&rft_id=info:doi/10.1109%2FDSAA49011.2020.00012&rft.externalDocID=9260028