Lightning Fast Matching Dependency Discovery with Desbordante

Matching dependency is a generalization of the functional dependency concept, which allows users to apply custom similarity functions for matching individual attributes. Matching dependencies have a wide range of applications for solving various data quality problems, such as entity resolution, data...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the XXth Conference of Open Innovations Association FRUCT pp. 729 - 740
Main Authors Shlyonskikh, Alexey, Sinelnikov, Michael, Nikolaev, Daniil, Litvinov, Yurii, Chernishev, George
Format Conference Proceeding
LanguageEnglish
Published FRUCT Oy 30.10.2024
Subjects
Online AccessGet full text
ISSN2305-7254
DOI10.23919/FRUCT64283.2024.10749955

Cover

More Information
Summary:Matching dependency is a generalization of the functional dependency concept, which allows users to apply custom similarity functions for matching individual attributes. Matching dependencies have a wide range of applications for solving various data quality problems, such as entity resolution, data deduplication, data integration, schema matching, and many more. However, their discovery is a very computationally intensive problem, which limits their practical application.In this paper, we describe a number of optimization techniques for HyMD - currently the state-of-the-art algorithm for the discovery of matching dependencies. These optimizations belong to both technical and scientific domains. The most important of them are: 1) a new sampling technique, 2) a faster generalization lookup technique, and 3) an improved representation of a dependency. The first one aims to raise the efficiency of inference from record pairs, while the last two are designed to speed up lattice-related operations.To evaluate our optimizations, we implemented our version of HyMD in Desbordante, an open-source high-performance data profiler. Experiments demonstrated that they allow for a speedup of more than 40x over the state-of-the-art implementation on average, reaching a speedup greater than 170x in some cases.Finally, the improved version of HyMD is ready to use by anyone. It comes with bidirectional Python integration, which allows calling the C++ algorithm implementation from Python programs while allowing users to supply their custom matching functions.
ISSN:2305-7254
DOI:10.23919/FRUCT64283.2024.10749955