Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

Journal of Artificial Intelligence Research, Vol 8, (1998), 67-91 This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of...

Full description

Saved in:

Bibliographic Details
Main Authors	Moore, A, Lee, M. S
Format	Journal Article
Language	English
Published	28.02.1998
Subjects	Computer Science - Artificial Intelligence
Online Access	Get full text
DOI	10.48550/arxiv.cs/9803102

Cover

Abstract	Journal of Artificial Intelligence Research, Vol 8, (1998), 67-91 This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.
AbstractList	Journal of Artificial Intelligence Research, Vol 8, (1998), 67-91 This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.
Author	Moore, A Lee, M. S
Author_xml	– sequence: 1 givenname: A surname: Moore fullname: Moore, A – sequence: 2 givenname: M. S surname: Lee fullname: Lee, M. S
BackLink	https://doi.org/10.48550/arXiv.cs/9803102$$DView paper in arXiv
BookMark	eNqFjrsOgkAQRbfQwtcHWDk_ICwiCdaIscBYQE8m6yxMoovZXV9_rzHE1uoU9yT3jMXAdIaEmEcyWKdJIkO0T74HyoWbVMaRXI3EMUPV0gnKm9asmIyH0qNn51k50J2F_DccPiobgoLQGjYNPNi3UKBtCLbo0ZF3UzHUeHY06zkRi11eZfvl97q-Wr6gfdXK1X1C_N94A14IPxU
ContentType	Journal Article
DBID	AKY GOX
DOI	10.48550/arxiv.cs/9803102
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	cs_9803102
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_cs_98031023
IEDL.DBID	GOX
IngestDate	Tue Jul 22 23:01:11 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_cs_98031023
OpenAccessLink	https://arxiv.org/abs/cs/9803102
ParticipantIDs	arxiv_primary_cs_9803102
PublicationCentury	1900
PublicationDate	1998-02-28
PublicationDateYYYYMMDD	1998-02-28
PublicationDate_xml	– month: 02 year: 1998 text: 1998-02-28 day: 28
PublicationDecade	1990
PublicationYear	1998
Score	2.5213068
SecondaryResourceType	preprint
Snippet	Journal of Artificial Intelligence Research, Vol 8, (1998), 67-91 This paper introduces new algorithms and data structures for quick counting for machine...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence
Title	Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
URI	https://arxiv.org/abs/cs/9803102
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVxNT8MwDLXGTlwQCND49oFrtZWmpT2ifWhCjB0AqbcqLjHigtDaIX4-dlJgl11jK7ISyX5x_B7ANUlRZccU1XlmtFvFkeXURFlSm5Eljl2qROHFYzZ_MfdlWvYAf7kwdvX9_hX0gakZ1s2wyFW8UpLsjgAFJfMuy_A56aW4Ov9_P8GYfmmjSMz2Ya9Dd3gXruMAeu7jEJZjFU1-xae1l2uQLI8K8YJCMgpoxOmfYeFnGx12sqdvqH1SfNBxbZzYVkpO2xzB1Wz6PJ5HPoTqM-hFVHVTddElx9CXV70bAOZcEOmUR3JrTTyiPI7ZcEpcGONuuDiBwbZdTrebzmA3kOaUc30O_Xa1dhdSNVu69Ef3AxsRdFU
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Cached+Sufficient+Statistics+for+Efficient+Machine+Learning+with+Large+Datasets&rft.au=Moore%2C+A&rft.au=Lee%2C+M.+S&rft.date=1998-02-28&rft_id=info:doi/10.48550%2Farxiv.cs%2F9803102&rft.externalDocID=cs_9803102