Accurately Quantifying a Billion Instances per Second
Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), dire...
Saved in:
Published in | 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) pp. 1 - 10 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.10.2020
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/DSAA49011.2020.00012 |
Cover
Abstract | Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), directly counts the output of a classifier. However, CC has a systematic bias that makes it increasingly misestimate the counts as the class distribution drifts away from a distribution it perfectly quantifies. This issue has motivated the development of more reliable quantification methods. Such newer methods can consistently outperform CC at the cost of a significant increase in processing requirements. Yet, for a large number of applications, quantification speed is an additional criterion that must be considered. Frequently, quantification methods need to deal with large amounts of data or fast-paced streams, as it is the case of news feeding, tweets and sensor data. In this paper, we propose Sample Mean Matching (SMM), a highly efficient algorithm able to quantify billions of data instances per second. We compare SMM to a set of 14 established and state-of-the-art quantifiers in an empirical analysis comprising 25 benchmark and real-world datasets. We show that SMM is competitive with state-of-the-art methods with no statistical difference in counting accuracy, and it is orders of magnitude faster than the vast majority of the algorithms. |
---|---|
AbstractList | Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), directly counts the output of a classifier. However, CC has a systematic bias that makes it increasingly misestimate the counts as the class distribution drifts away from a distribution it perfectly quantifies. This issue has motivated the development of more reliable quantification methods. Such newer methods can consistently outperform CC at the cost of a significant increase in processing requirements. Yet, for a large number of applications, quantification speed is an additional criterion that must be considered. Frequently, quantification methods need to deal with large amounts of data or fast-paced streams, as it is the case of news feeding, tweets and sensor data. In this paper, we propose Sample Mean Matching (SMM), a highly efficient algorithm able to quantify billions of data instances per second. We compare SMM to a set of 14 established and state-of-the-art quantifiers in an empirical analysis comprising 25 benchmark and real-world datasets. We show that SMM is competitive with state-of-the-art methods with no statistical difference in counting accuracy, and it is orders of magnitude faster than the vast majority of the algorithms. |
Author | Maletzke, Andre Hassan, Waqar Batista, Gustavo |
Author_xml | – sequence: 1 givenname: Waqar surname: Hassan fullname: Hassan, Waqar email: waqar@usp.br organization: ICMC-USP,São Carlos,Brazil – sequence: 2 givenname: Andre surname: Maletzke fullname: Maletzke, Andre email: andre.maletzke@unioeste.br organization: ICMC-USP,São Carlos,Brazil – sequence: 3 givenname: Gustavo surname: Batista fullname: Batista, Gustavo email: g.batista@unsw.edu.au organization: ICMC-USP,São Carlos,Brazil |
BookMark | eNotzLtOwzAUAFAjwUALXwCDfyDh3mvHjzGUV6VKqGqZK8cPZCk4VZIO-XsGmM52Vuy6DCUy9ohQI4J9ejm0rbSAWBMQ1ACAdMVWqMmgIVD2ljWt95fRzbFf-P7iypzTkss3d_w5930eCt-WaXbFx4mf48gP0Q8l3LGb5Pop3v-7Zl9vr8fNR7X7fN9u2l2VCcRcaRJKCJTWNJIarYMC8o6StNoImyBqYUKXggtotcBOkVaNR4supC5IEGv28PfmGOPpPOYfNy4nSwqAjPgF0RRBNg |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/DSAA49011.2020.00012 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 1728182069 9781728182063 |
EndPage | 10 |
ExternalDocumentID | 9260028 |
Genre | orig-research |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i203t-7236331498542577d602ca2f497839f0e738dbfdad19731b62765c191adfbd403 |
IEDL.DBID | RIE |
IngestDate | Thu Jun 29 18:37:56 EDT 2023 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i203t-7236331498542577d602ca2f497839f0e738dbfdad19731b62765c191adfbd403 |
PageCount | 10 |
ParticipantIDs | ieee_primary_9260028 |
PublicationCentury | 2000 |
PublicationDate | 2020-Oct. |
PublicationDateYYYYMMDD | 2020-10-01 |
PublicationDate_xml | – month: 10 year: 2020 text: 2020-Oct. |
PublicationDecade | 2020 |
PublicationTitle | 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) |
PublicationTitleAbbrev | DSAA |
PublicationYear | 2020 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.8476894 |
Snippet | Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1 |
SubjectTerms | Machine Learning Mixture Methods Prediction algorithms Probabilistic logic Proposals Quantification Systematics Task analysis Time complexity Training |
Title | Accurately Quantifying a Billion Instances per Second |
URI | https://ieeexplore.ieee.org/document/9260028 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LSsNAFL20XblSacU3s3Bp2uk8k2V8lCpUlFroriTzAFHSUpKFfr1zk1pFXLgLs0lmwuTMuTnnHoALkwvJvTRRgFMZCIpOIoy0jbi3AS2Y5XmM3uHJgxrPxP1czltwufXCOOdq8Znr42X9L98uTYWlskGC3dRZ3Ia21knj1dq44YY0GdxM01SgkzKwPoaCLYoxkz8yU2rIGO3C5OtmjVLktV-Ved98_OrD-N-n2YPetzmPPG5hZx9aruiCTI2psOvD2zt5qjJUAKF_iWTkCgsqy4Lc1efA8FUgK7cmU-TBtgez0e3z9TjaJCJEL4zyMtKMK84DqYkl7jVtFWUmYx5j4njiqdM8trm3mR1iJFWumFbSBEqWWZ9bQfkBdIpl4Q6BWCGTIbNCcZoJLAYKrryIfRg1AeLjI-jilBerpunFYjPb47-HT2AHF71RuZ1Cp1xX7iygdZmf16_pE6_hkyA |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LTsJAFL1BXOhKDRjfzsKlhWEefSzxQUCBaICEHWnnkRhNS0i70K93botojAt3zWzayWTm9Nw55x6AK5UIya1UnoNT6QhKEHkYaetxqx1aMM2TEL3Do7Hfn4mHuZzX4HrjhTHGlOIz08LH8i5fZ6rAUlk7wm7qLNyCbelYRVC5tdZ-uA6N2neTblegl9LxPoaSLYpBkz9SU0rQ6O3B6Ot1lVbktVXkSUt9_OrE-N_v2Yfmtz2PPG2A5wBqJm2A7CpVYN-Ht3fyXMSoAUIHE4nJDZZUspQMyj9Bdy6QpVmRCTJh3YRZ73562_fWmQjeC6M89wLGfc4drQkl7rZA-5SpmFkMiuORpSbgoU6sjnUHQ6kSnwW-VI6UxdomWlB-CPU0S80REC1k1GFa-JzGAsuBgvtWhNaNKgfy4TE0cMqLZdX2YrGe7cnfw5ew05-OhovhYPx4Cru4AJXm7Qzq-aow5w678-SiXLJPoTqWcQ |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2020+IEEE+7th+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=Accurately+Quantifying+a+Billion+Instances+per+Second&rft.au=Hassan%2C+Waqar&rft.au=Maletzke%2C+Andre&rft.au=Batista%2C+Gustavo&rft.date=2020-10-01&rft.pub=IEEE&rft.spage=1&rft.epage=10&rft_id=info:doi/10.1109%2FDSAA49011.2020.00012&rft.externalDocID=9260028 |