Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

Journal of Artificial Intelligence Research, Vol 8, (1998), 67-91 This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of...

Full description

Saved in:
Bibliographic Details
Main Authors Moore, A, Lee, M. S
Format Journal Article
LanguageEnglish
Published 28.02.1998
Subjects
Online AccessGet full text
DOI10.48550/arxiv.cs/9803102

Cover

Abstract Journal of Artificial Intelligence Research, Vol 8, (1998), 67-91 This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.
AbstractList Journal of Artificial Intelligence Research, Vol 8, (1998), 67-91 This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.
Author Moore, A
Lee, M. S
Author_xml – sequence: 1
  givenname: A
  surname: Moore
  fullname: Moore, A
– sequence: 2
  givenname: M. S
  surname: Lee
  fullname: Lee, M. S
BackLink https://doi.org/10.48550/arXiv.cs/9803102$$DView paper in arXiv
BookMark eNqFjrsOgkAQRbfQwtcHWDk_ICwiCdaIscBYQE8m6yxMoovZXV9_rzHE1uoU9yT3jMXAdIaEmEcyWKdJIkO0T74HyoWbVMaRXI3EMUPV0gnKm9asmIyH0qNn51k50J2F_DccPiobgoLQGjYNPNi3UKBtCLbo0ZF3UzHUeHY06zkRi11eZfvl97q-Wr6gfdXK1X1C_N94A14IPxU
ContentType Journal Article
DBID AKY
GOX
DOI 10.48550/arxiv.cs/9803102
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID cs_9803102
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_cs_98031023
IEDL.DBID GOX
IngestDate Tue Jul 22 23:01:11 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_cs_98031023
OpenAccessLink https://arxiv.org/abs/cs/9803102
ParticipantIDs arxiv_primary_cs_9803102
PublicationCentury 1900
PublicationDate 1998-02-28
PublicationDateYYYYMMDD 1998-02-28
PublicationDate_xml – month: 02
  year: 1998
  text: 1998-02-28
  day: 28
PublicationDecade 1990
PublicationYear 1998
Score 2.5213068
SecondaryResourceType preprint
Snippet Journal of Artificial Intelligence Research, Vol 8, (1998), 67-91 This paper introduces new algorithms and data structures for quick counting for machine...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Title Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
URI https://arxiv.org/abs/cs/9803102
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdVxNT8MwDLXGTlwQCND49oFrtZWmpT2ifWhCjB0AqbcqLjHigtDaIX4-dlJgl11jK7ISyX5x_B7ANUlRZccU1XlmtFvFkeXURFlSm5Eljl2qROHFYzZ_MfdlWvYAf7kwdvX9_hX0gakZ1s2wyFW8UpLsjgAFJfMuy_A56aW4Ov9_P8GYfmmjSMz2Ya9Dd3gXruMAeu7jEJZjFU1-xae1l2uQLI8K8YJCMgpoxOmfYeFnGx12sqdvqH1SfNBxbZzYVkpO2xzB1Wz6PJ5HPoTqM-hFVHVTddElx9CXV70bAOZcEOmUR3JrTTyiPI7ZcEpcGONuuDiBwbZdTrebzmA3kOaUc30O_Xa1dhdSNVu69Ef3AxsRdFU
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Cached+Sufficient+Statistics+for+Efficient+Machine+Learning+with+Large+Datasets&rft.au=Moore%2C+A&rft.au=Lee%2C+M.+S&rft.date=1998-02-28&rft_id=info:doi/10.48550%2Farxiv.cs%2F9803102&rft.externalDocID=cs_9803102