Lightning Fast Matching Dependency Discovery with Desbordante
Matching dependency is a generalization of the functional dependency concept, which allows users to apply custom similarity functions for matching individual attributes. Matching dependencies have a wide range of applications for solving various data quality problems, such as entity resolution, data...
        Saved in:
      
    
          | Published in | Proceedings of the XXth Conference of Open Innovations Association FRUCT pp. 729 - 740 | 
|---|---|
| Main Authors | , , , , | 
| Format | Conference Proceeding | 
| Language | English | 
| Published | 
            FRUCT Oy
    
        30.10.2024
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 2305-7254 | 
| DOI | 10.23919/FRUCT64283.2024.10749955 | 
Cover
| Summary: | Matching dependency is a generalization of the functional dependency concept, which allows users to apply custom similarity functions for matching individual attributes. Matching dependencies have a wide range of applications for solving various data quality problems, such as entity resolution, data deduplication, data integration, schema matching, and many more. However, their discovery is a very computationally intensive problem, which limits their practical application.In this paper, we describe a number of optimization techniques for HyMD - currently the state-of-the-art algorithm for the discovery of matching dependencies. These optimizations belong to both technical and scientific domains. The most important of them are: 1) a new sampling technique, 2) a faster generalization lookup technique, and 3) an improved representation of a dependency. The first one aims to raise the efficiency of inference from record pairs, while the last two are designed to speed up lattice-related operations.To evaluate our optimizations, we implemented our version of HyMD in Desbordante, an open-source high-performance data profiler. Experiments demonstrated that they allow for a speedup of more than 40x over the state-of-the-art implementation on average, reaching a speedup greater than 170x in some cases.Finally, the improved version of HyMD is ready to use by anyone. It comes with bidirectional Python integration, which allows calling the C++ algorithm implementation from Python programs while allowing users to supply their custom matching functions. | 
|---|---|
| ISSN: | 2305-7254 | 
| DOI: | 10.23919/FRUCT64283.2024.10749955 |