Accurately Quantifying a Billion Instances per Second

Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), dire...

Full description

Saved in:

Bibliographic Details
Published in	2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) pp. 1 - 10
Main Authors	Hassan, Waqar, Maletzke, Andre, Batista, Gustavo
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2020
Subjects	Machine Learning Mixture Methods Prediction algorithms Probabilistic logic Proposals Quantification Systematics Task analysis Time complexity Training
Online Access	Get full text
DOI	10.1109/DSAA49011.2020.00012

Cover

Abstract	Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), directly counts the output of a classifier. However, CC has a systematic bias that makes it increasingly misestimate the counts as the class distribution drifts away from a distribution it perfectly quantifies. This issue has motivated the development of more reliable quantification methods. Such newer methods can consistently outperform CC at the cost of a significant increase in processing requirements. Yet, for a large number of applications, quantification speed is an additional criterion that must be considered. Frequently, quantification methods need to deal with large amounts of data or fast-paced streams, as it is the case of news feeding, tweets and sensor data. In this paper, we propose Sample Mean Matching (SMM), a highly efficient algorithm able to quantify billions of data instances per second. We compare SMM to a set of 14 established and state-of-the-art quantifiers in an empirical analysis comprising 25 benchmark and real-world datasets. We show that SMM is competitive with state-of-the-art methods with no statistical difference in counting accuracy, and it is orders of magnitude faster than the vast majority of the algorithms.
AbstractList	Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), directly counts the output of a classifier. However, CC has a systematic bias that makes it increasingly misestimate the counts as the class distribution drifts away from a distribution it perfectly quantifies. This issue has motivated the development of more reliable quantification methods. Such newer methods can consistently outperform CC at the cost of a significant increase in processing requirements. Yet, for a large number of applications, quantification speed is an additional criterion that must be considered. Frequently, quantification methods need to deal with large amounts of data or fast-paced streams, as it is the case of news feeding, tweets and sensor data. In this paper, we propose Sample Mean Matching (SMM), a highly efficient algorithm able to quantify billions of data instances per second. We compare SMM to a set of 14 established and state-of-the-art quantifiers in an empirical analysis comprising 25 benchmark and real-world datasets. We show that SMM is competitive with state-of-the-art methods with no statistical difference in counting accuracy, and it is orders of magnitude faster than the vast majority of the algorithms.
Author	Maletzke, Andre Hassan, Waqar Batista, Gustavo
Author_xml	– sequence: 1 givenname: Waqar surname: Hassan fullname: Hassan, Waqar email: waqar@usp.br organization: ICMC-USP,São Carlos,Brazil – sequence: 2 givenname: Andre surname: Maletzke fullname: Maletzke, Andre email: andre.maletzke@unioeste.br organization: ICMC-USP,São Carlos,Brazil – sequence: 3 givenname: Gustavo surname: Batista fullname: Batista, Gustavo email: g.batista@unsw.edu.au organization: ICMC-USP,São Carlos,Brazil
BookMark	eNotzLtOwzAUAFAjwUALXwCDfyDh3mvHjzGUV6VKqGqZK8cPZCk4VZIO-XsGmM52Vuy6DCUy9ohQI4J9ejm0rbSAWBMQ1ACAdMVWqMmgIVD2ljWt95fRzbFf-P7iypzTkss3d_w5930eCt-WaXbFx4mf48gP0Q8l3LGb5Pop3v-7Zl9vr8fNR7X7fN9u2l2VCcRcaRJKCJTWNJIarYMC8o6StNoImyBqYUKXggtotcBOkVaNR4supC5IEGv28PfmGOPpPOYfNy4nSwqAjPgF0RRBNg
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/DSAA49011.2020.00012
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1728182069 9781728182063
EndPage	10
ExternalDocumentID	9260028
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i203t-7236331498542577d602ca2f497839f0e738dbfdad19731b62765c191adfbd403
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:37:56 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-7236331498542577d602ca2f497839f0e738dbfdad19731b62765c191adfbd403
PageCount	10
ParticipantIDs	ieee_primary_9260028
PublicationCentury	2000
PublicationDate	2020-Oct.
PublicationDateYYYYMMDD	2020-10-01
PublicationDate_xml	– month: 10 year: 2020 text: 2020-Oct.
PublicationDecade	2020
PublicationTitle	2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)
PublicationTitleAbbrev	DSAA
PublicationYear	2020
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.8476894
Snippet	Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Machine Learning Mixture Methods Prediction algorithms Probabilistic logic Proposals Quantification Systematics Task analysis Time complexity Training
Title	Accurately Quantifying a Billion Instances per Second
URI	https://ieeexplore.ieee.org/document/9260028
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LSsNAFL20XblSacU3s3Bp2uk8k2V8lCpUlFroriTzAFHSUpKFfr1zk1pFXLgLs0lmwuTMuTnnHoALkwvJvTRRgFMZCIpOIoy0jbi3AS2Y5XmM3uHJgxrPxP1czltwufXCOOdq8Znr42X9L98uTYWlskGC3dRZ3Ia21knj1dq44YY0GdxM01SgkzKwPoaCLYoxkz8yU2rIGO3C5OtmjVLktV-Ved98_OrD-N-n2YPetzmPPG5hZx9aruiCTI2psOvD2zt5qjJUAKF_iWTkCgsqy4Lc1efA8FUgK7cmU-TBtgez0e3z9TjaJCJEL4zyMtKMK84DqYkl7jVtFWUmYx5j4njiqdM8trm3mR1iJFWumFbSBEqWWZ9bQfkBdIpl4Q6BWCGTIbNCcZoJLAYKrryIfRg1AeLjI-jilBerpunFYjPb47-HT2AHF71RuZ1Cp1xX7iygdZmf16_pE6_hkyA
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LTsJAFL1BXOhKDRjfzsKlhWEefSzxQUCBaICEHWnnkRhNS0i70K93botojAt3zWzayWTm9Nw55x6AK5UIya1UnoNT6QhKEHkYaetxqx1aMM2TEL3Do7Hfn4mHuZzX4HrjhTHGlOIz08LH8i5fZ6rAUlk7wm7qLNyCbelYRVC5tdZ-uA6N2neTblegl9LxPoaSLYpBkz9SU0rQ6O3B6Ot1lVbktVXkSUt9_OrE-N_v2Yfmtz2PPG2A5wBqJm2A7CpVYN-Ht3fyXMSoAUIHE4nJDZZUspQMyj9Bdy6QpVmRCTJh3YRZ73562_fWmQjeC6M89wLGfc4drQkl7rZA-5SpmFkMiuORpSbgoU6sjnUHQ6kSnwW-VI6UxdomWlB-CPU0S80REC1k1GFa-JzGAsuBgvtWhNaNKgfy4TE0cMqLZdX2YrGe7cnfw5ew05-OhovhYPx4Cru4AJXm7Qzq-aow5w678-SiXLJPoTqWcQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2020+IEEE+7th+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=Accurately+Quantifying+a+Billion+Instances+per+Second&rft.au=Hassan%2C+Waqar&rft.au=Maletzke%2C+Andre&rft.au=Batista%2C+Gustavo&rft.date=2020-10-01&rft.pub=IEEE&rft.spage=1&rft.epage=10&rft_id=info:doi/10.1109%2FDSAA49011.2020.00012&rft.externalDocID=9260028