Discovering dependencies with reliable mutual information

We consider the task of discovering functional dependencies in data for target attributes of interest. To solve it, we have to answer two questions: How do we quantify the dependency in a model-agnostic and interpretable way as well as reliably against sample size and dimensionality biases? How can...

Full description

Saved in:
Bibliographic Details
Published inKnowledge and information systems Vol. 62; no. 11; pp. 4223 - 4253
Main Authors Mandros, Panagiotis, Boley, Mario, Vreeken, Jilles
Format Journal Article
LanguageEnglish
Published London Springer London 01.11.2020
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN0219-1377
0219-3116
0219-3116
DOI10.1007/s10115-020-01494-9

Cover

More Information
Summary:We consider the task of discovering functional dependencies in data for target attributes of interest. To solve it, we have to answer two questions: How do we quantify the dependency in a model-agnostic and interpretable way as well as reliably against sample size and dimensionality biases? How can we efficiently discover the exact or α -approximate top- k dependencies? We address the first question by adopting information-theoretic notions. Specifically, we consider the mutual information score, for which we propose a reliable estimator that enables robust optimization in high-dimensional data. To address the second question, we then systematically explore the algorithmic implications of using this measure for optimization. We show the problem is NP-hard and justify worst-case exponential-time as well as heuristic search methods. We propose two bounding functions for the estimator, which we use as pruning criteria in branch-and-bound search to efficiently mine dependencies with approximation guarantees. Empirical evaluation shows that the derived estimator has desirable statistical properties, the bounding functions lead to effective exact and greedy search algorithms, and when combined, qualitative experiments show the framework indeed discovers highly informative dependencies.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0219-1377
0219-3116
0219-3116
DOI:10.1007/s10115-020-01494-9