The pq -gram distance between ordered labeled trees
When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the...
Saved in:
| Published in | ACM transactions on database systems Vol. 35; no. 1; pp. 1 - 36 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
New York, NY
Association for Computing Machinery
01.02.2010
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0362-5915 1557-4644 1557-4644 |
| DOI | 10.1145/1670243.1670247 |
Cover
| Summary: | When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ.
We propose pq -grams to approximately match hierarchical data from autonomous sources and define the pq -gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq -gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq -gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq -grams. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 ObjectType-Article-2 ObjectType-Feature-1 |
| ISSN: | 0362-5915 1557-4644 1557-4644 |
| DOI: | 10.1145/1670243.1670247 |