The pq -gram distance between ordered labeled trees
When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the...
        Saved in:
      
    
          | Published in | ACM transactions on database systems Vol. 35; no. 1; pp. 1 - 36 | 
|---|---|
| Main Authors | , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        New York, NY
          Association for Computing Machinery
    
        01.02.2010
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0362-5915 1557-4644 1557-4644  | 
| DOI | 10.1145/1670243.1670247 | 
Cover
| Abstract | When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ.
We propose pq -grams to approximately match hierarchical data from autonomous sources and define the pq -gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq -gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq -gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq -grams. | 
    
|---|---|
| AbstractList | When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq -grams to approximately match hierarchical data from autonomous sources and define the pq -gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq -gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq -gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq -grams.  | 
    
| Author | Augsten, Nikolaus Böhlen, Michael Gamper, Johann  | 
    
| Author_xml | – sequence: 1 givenname: Nikolaus surname: Augsten fullname: Augsten, Nikolaus organization: Free University of Bozen-Bolzano, Bozen-Bolzano, Italy – sequence: 2 givenname: Michael surname: Böhlen fullname: Böhlen, Michael organization: Free University of Bozen-Bolzano, Bozen-Bolzano, Italy – sequence: 3 givenname: Johann surname: Gamper fullname: Gamper, Johann organization: Free University of Bozen-Bolzano, Bozen-Bolzano, Italy  | 
    
| BackLink | http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=22484597$$DView record in Pascal Francis | 
    
| BookMark | eNqFkL1PwzAQxS1UJNrCzJoFwZLWjr_iEVV8SZVYyhxd7AsEuUlrp6r635OqEQNDmZ50-r13925CRk3bICG3jM4YE3LOlKaZ4LOT6gsyZlLqVCghRmRMucpSaZi8IpMYvymlIjd6TPjqC5PNNkk_A6wTV8cOGotJid0esUna4DCgSzyU6HvtAmK8JpcV-Ig3g07Jx_PTavGaLt9f3haPy9RyrbtUUyU5lJaXjGZMlwpKJyoGJdOojHDoMrDOKsNQgJOcZrISjpl-mpfGZHxK6Cl312zgsAfvi02o1xAOBaPFsXQxlB5U95b7k2UT2u0OY1es62jRe2iw3cVCC66VzqXsyYezZJ_IBDO5Vj16N6AQLfgq9C-q4-8tWSZyIc1xuTxxNrQxBqwKW3fQ1W3TBaj9maPnf3z_1fwBQ9mR5Q | 
    
| CODEN | ATDSD3 | 
    
| CitedBy_id | crossref_primary_10_1587_transinf_2015DAP0023 crossref_primary_10_1007_s10845_021_01854_4 crossref_primary_10_1145_2590989_2590994 crossref_primary_10_1016_j_eswa_2024_125973 crossref_primary_10_14778_2095686_2095692 crossref_primary_10_1007_s10472_015_9467_5 crossref_primary_10_1016_j_is_2015_08_004 crossref_primary_10_1016_j_ins_2014_06_007 crossref_primary_10_3103_S0278641915040056 crossref_primary_10_2200_S00544ED1V01Y201310DTM038 crossref_primary_10_14778_3565816_3565833 crossref_primary_10_4018_ijkbo_2013100105 crossref_primary_10_1145_2699485 crossref_primary_10_1007_s10115_012_0582_x crossref_primary_10_3390_data8090140 crossref_primary_10_1007_s00778_011_0254_6 crossref_primary_10_1007_s00778_012_0263_0 crossref_primary_10_1007_s00778_013_0306_1 crossref_primary_10_1016_j_knosys_2023_110565 crossref_primary_10_1109_TKDE_2010_245 crossref_primary_10_1007_s10115_014_0816_1 crossref_primary_10_14778_3231751_3231760 crossref_primary_10_14778_3654621_3654634  | 
    
| Cites_doi | 10.1016/0304-3975(95)80015-8 10.1007/11563983_17 10.1016/B978-155860920-4/50002-7 10.1145/375663.375722 10.1145/564691.564725 10.1016/j.datak.2007.05.008 10.1145/322139.322143 10.1145/1066157.1066243 10.1006/jagm.2001.1170 10.1016/j.is.2004.11.009 10.1016/0304-3975(92)90143-4 10.1145/1007568.1007599 10.1109/TKDE.2005.27 10.1145/1066157.1066207 10.1007/978-3-540-85713-6_18 10.1145/1061318.1061326 10.1137/0218082 10.1145/233269.233366 10.1145/564691.564705 10.1145/1007568.1007686 10.1145/872757.872832 10.1145/564691.564715 10.1109/SPIRE.2001.989761 10.1016/0020-0190(77)90064-3 10.1145/564691.564727 10.1145/375360.375365 10.1109/TKDE.2004.19 10.1007/978-3-540-30078-6_17 10.1016/0031-3203(94)00109-Y 10.1109/ICDE.2008.4497490 10.1007/11687238_46 10.1002/spe.4380210706  | 
    
| ContentType | Journal Article | 
    
| Copyright | 2015 INIST-CNRS | 
    
| Copyright_xml | – notice: 2015 INIST-CNRS | 
    
| DBID | AAYXX CITATION IQODW 7SC 8FD JQ2 L7M L~C L~D ADTOC UNPAY  | 
    
| DOI | 10.1145/1670243.1670247 | 
    
| DatabaseName | CrossRef Pascal-Francis Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts  Academic Computer and Information Systems Abstracts Professional Unpaywall for CDI: Periodical Content Unpaywall  | 
    
| DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional  | 
    
| DatabaseTitleList | Computer and Information Systems Abstracts CrossRef Computer and Information Systems Abstracts  | 
    
| Database_xml | – sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Sciences (General) Computer Science Applied Sciences  | 
    
| EISSN | 1557-4644 | 
    
| EndPage | 36 | 
    
| ExternalDocumentID | oai:www.zora.uzh.ch:44572 22484597 10_1145_1670243_1670247  | 
    
| GroupedDBID | --Z -DZ -~X .DC 23M 4.4 5GY 5VS 6J9 8US 8VB 9M8 AAFWJ AAKMM AALFJ AAYFX AAYXX ABFSI ABPPZ ACGFO ACGOD ACM ADBCU ADL ADMHC ADMLS AEBYY AEFXT AEGXH AEJOY AEMOZ AENEX AENSD AETEA AFWIH AFWXC AHQJS AI. AIAGR AIKLT AKRVB AKVCP ALMA_UNASSIGNED_HOLDINGS ASPBG AVWKF BAAKF BDXCO CCLIF CITATION CS3 D0L E.L EBS EJD FEDTE GUFHI HF~ HGAVV H~9 I07 IAO ICD IEA IGS IOF ITC K1G LHSKQ MVM N95 NEJ NHB OHT P1C P2P PQQKQ QWB RNS ROL RXW TAE TH9 U5U UKR UPT VH1 WH7 X6Y XH6 XJT XOL XSW ZCA ZCG ZHY ZL0 ZY4 IQODW 7SC 8FD JQ2 L7M L~C L~D ABUFD ADTOC UNPAY  | 
    
| ID | FETCH-LOGICAL-c377t-70653abc3b10217b6abd4f1ab17e694ded2acdc691e4ad53025f4d192ac8b9923 | 
    
| IEDL.DBID | UNPAY | 
    
| ISSN | 0362-5915 1557-4644  | 
    
| IngestDate | Sun Oct 26 04:09:44 EDT 2025 Fri Jul 11 00:01:09 EDT 2025 Fri Jul 11 10:35:35 EDT 2025 Mon Jul 21 09:11:14 EDT 2025 Thu Apr 24 22:52:44 EDT 2025 Wed Oct 01 06:00:11 EDT 2025  | 
    
| IsDoiOpenAccess | true | 
    
| IsOpenAccess | true | 
    
| IsPeerReviewed | true | 
    
| IsScholarly | true | 
    
| Issue | 1 | 
    
| Keywords | Lower bound Input output distance metric similarity search Scalability XML language tree edit distance Tree structure Graph theory Weighted graph Algorithms approximate matching hierarchical data Database algorithms XML Database Ordering Experimentation Metric Graph labelling Performance Distance  | 
    
| Language | English | 
    
| License | CC BY 4.0 other-oa  | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-c377t-70653abc3b10217b6abd4f1ab17e694ded2acdc691e4ad53025f4d192ac8b9923 | 
    
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 ObjectType-Article-2 ObjectType-Feature-1  | 
    
| OpenAccessLink | https://proxy.k.utb.cz/login?url=https://www.zora.uzh.ch/id/eprint/44572/1/a4-augsten.pdf | 
    
| PQID | 1671419876 | 
    
| PQPubID | 23500 | 
    
| PageCount | 36 | 
    
| ParticipantIDs | unpaywall_primary_10_1145_1670243_1670247 proquest_miscellaneous_743767855 proquest_miscellaneous_1671419876 pascalfrancis_primary_22484597 crossref_citationtrail_10_1145_1670243_1670247 crossref_primary_10_1145_1670243_1670247  | 
    
| ProviderPackageCode | CITATION AAYXX  | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 2010-02-01 | 
    
| PublicationDateYYYYMMDD | 2010-02-01 | 
    
| PublicationDate_xml | – month: 02 year: 2010 text: 2010-02-01 day: 01  | 
    
| PublicationDecade | 2010 | 
    
| PublicationPlace | New York, NY | 
    
| PublicationPlace_xml | – name: New York, NY | 
    
| PublicationTitle | ACM transactions on database systems | 
    
| PublicationYear | 2010 | 
    
| Publisher | Association for Computing Machinery | 
    
| Publisher_xml | – name: Association for Computing Machinery | 
    
| References | Zezula P. (e_1_2_1_42_1); 32 Klein P. N. (e_1_2_1_22_1) e_1_2_1_41_1 e_1_2_1_40_1 Augsten N. (e_1_2_1_4_1) e_1_2_1_23_1 e_1_2_1_24_1 e_1_2_1_45_1 e_1_2_1_21_1 e_1_2_1_44_1 e_1_2_1_43_1 e_1_2_1_27_1 e_1_2_1_28_1 e_1_2_1_26_1 Helmer S. (e_1_2_1_19_1) 2007 e_1_2_1_29_1 Gravano L. (e_1_2_1_16_1) Jiang H. (e_1_2_1_20_1) Nierman A. (e_1_2_1_25_1) Celko J. (e_1_2_1_6_1) 1994; 7 e_1_2_1_7_1 e_1_2_1_31_1 Al-Khalifa S. (e_1_2_1_1_1) e_1_2_1_8_1 e_1_2_1_30_1 e_1_2_1_5_1 e_1_2_1_3_1 e_1_2_1_12_1 Demaine E. D. (e_1_2_1_13_1); 4596 e_1_2_1_35_1 Cobéna G. (e_1_2_1_10_1) e_1_2_1_34_1 e_1_2_1_33_1 e_1_2_1_2_1 e_1_2_1_11_1 e_1_2_1_32_1 e_1_2_1_39_1 e_1_2_1_17_1 e_1_2_1_38_1 e_1_2_1_14_1 e_1_2_1_37_1 e_1_2_1_15_1 e_1_2_1_36_1 e_1_2_1_9_1 e_1_2_1_18_1  | 
    
| References_xml | – ident: e_1_2_1_21_1 doi: 10.1016/0304-3975(95)80015-8 – ident: e_1_2_1_26_1 doi: 10.1007/11563983_17 – volume-title: Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 273--284 ident: e_1_2_1_20_1 – ident: e_1_2_1_7_1 doi: 10.1016/B978-155860920-4/50002-7 – ident: e_1_2_1_43_1 doi: 10.1145/375663.375722 – volume-title: Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Science Press, 41--52 ident: e_1_2_1_10_1 – ident: e_1_2_1_18_1 doi: 10.1145/564691.564725 – ident: e_1_2_1_31_1 doi: 10.1016/j.datak.2007.05.008 – ident: e_1_2_1_33_1 doi: 10.1145/322139.322143 – ident: e_1_2_1_39_1 doi: 10.1145/1066157.1066243 – ident: e_1_2_1_9_1 doi: 10.1006/jagm.2001.1170 – volume-title: Proceedings of the International Conference on Very Large Databases (VLDB). ACM Press, 1022--1032 year: 2007 ident: e_1_2_1_19_1 – ident: e_1_2_1_11_1 doi: 10.1016/j.is.2004.11.009 – volume: 32 volume-title: Advances in Database Systems ident: e_1_2_1_42_1 – ident: e_1_2_1_35_1 doi: 10.1016/0304-3975(92)90143-4 – volume: 4596 volume-title: Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP'07) ident: e_1_2_1_13_1 – volume-title: Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 491--500 ident: e_1_2_1_16_1 – ident: e_1_2_1_28_1 doi: 10.1145/1007568.1007599 – ident: e_1_2_1_14_1 doi: 10.1109/TKDE.2005.27 – volume-title: Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Science Press, 141--152 ident: e_1_2_1_1_1 – ident: e_1_2_1_37_1 – ident: e_1_2_1_38_1 doi: 10.1145/1066157.1066207 – ident: e_1_2_1_30_1 doi: 10.1007/978-3-540-85713-6_18 – ident: e_1_2_1_15_1 doi: 10.1145/1061318.1061326 – ident: e_1_2_1_45_1 doi: 10.1137/0218082 – ident: e_1_2_1_8_1 doi: 10.1145/233269.233366 – ident: e_1_2_1_17_1 doi: 10.1145/564691.564705 – ident: e_1_2_1_27_1 doi: 10.1145/1007568.1007686 – volume-title: Proceedings of the International Conference on Very Large Databases (VLDB). ACM Press, 301--312 ident: e_1_2_1_4_1 – ident: e_1_2_1_12_1 doi: 10.1145/872757.872832 – volume-title: Proceedings of the 6th European Symposium on Algorithms ident: e_1_2_1_22_1 – volume-title: Proceedings of the 5th International Workshop on the Web and Databases (WebDB). ident: e_1_2_1_25_1 – ident: e_1_2_1_34_1 doi: 10.1145/564691.564715 – ident: e_1_2_1_36_1 doi: 10.1109/SPIRE.2001.989761 – ident: e_1_2_1_32_1 doi: 10.1016/0020-0190(77)90064-3 – ident: e_1_2_1_5_1 doi: 10.1145/564691.564727 – volume: 7 start-page: 48 year: 1994 ident: e_1_2_1_6_1 article-title: Trees, databases and SQL publication-title: Datab. Program. Des. – ident: e_1_2_1_24_1 doi: 10.1145/375360.375365 – ident: e_1_2_1_23_1 doi: 10.1109/TKDE.2004.19 – ident: e_1_2_1_3_1 doi: 10.1007/978-3-540-30078-6_17 – ident: e_1_2_1_44_1 doi: 10.1016/0031-3203(94)00109-Y – ident: e_1_2_1_41_1 – ident: e_1_2_1_2_1 doi: 10.1109/ICDE.2008.4497490 – ident: e_1_2_1_29_1 doi: 10.1007/11687238_46 – ident: e_1_2_1_40_1 doi: 10.1002/spe.4380210706  | 
    
| SSID | ssj0004897 | 
    
| Score | 2.188624 | 
    
| Snippet | When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys.... | 
    
| SourceID | unpaywall proquest pascalfrancis crossref  | 
    
| SourceType | Open Access Repository Aggregation Database Index Database Enrichment Source  | 
    
| StartPage | 1 | 
    
| SubjectTerms | Algorithmics. Computability. Computer arithmetics Applied sciences Approximation Autonomous Computer science; control theory; systems Computer systems and distributed systems. User interface Exact sciences and technology Information systems. Data bases Keys Matching Memory organisation. Data processing Representations Software Theoretical computing Trees  | 
    
| Title | The pq -gram distance between ordered labeled trees | 
    
| URI | https://www.proquest.com/docview/1671419876 https://www.proquest.com/docview/743767855 https://www.zora.uzh.ch/id/eprint/44572/1/a4-augsten.pdf  | 
    
| UnpaywallVersion | submittedVersion | 
    
| Volume | 35 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVEBS databaseName: Inspec with Full Text customDbUrl: eissn: 1557-4644 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0004897 issn: 0362-5915 databaseCode: ADMLS dateStart: 19961201 isFulltext: true titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text providerName: EBSCOhost  | 
    
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3da9RAEB_q3YOCWFstPT-OFXxoHzZpkv1IHg-1FLFF0IP26dhPCx5p9BLE--udbTaHJxTRp83DJNndmc38sjP7G4DXpXFasRNFFbcnlBlXUB3WlTdacy-McD5s6J9fiLM5e3_JL3egHM7ChLTKNQ4-6dbXiblOUSMu7HC1KWNc5mmWKkZV9wWnoE4a6-_BWHBE4SMYzy8-zq4Gtihe3RYvQG8pKUOfH1l9MsbTTMhAw5f0rdxySA8btcK58X1Riy3Ueb-rG_Xzh1ouf3NAp7twNXS9zzv5mnStTsz6D1bH_xnbY3gUUSmZ9Wa0Bzuu3ofdoeIDiR-AfdiLVytyFPmqj59AgaZGmm-EhkQvYgMgRRESM8DILbmnswTNDV2cJSEMvnoK89N3n9-c0ViLgZpCypaGaGihtCl0qAUutVDaMp8pnUknKmadzZWxRlSZY8qGUkTcM4vwUZlSV4giD2BU39TuEPD1uRK6DCHfnHlX6IqVJveyFLYM3DUTSAaFLEwkKg_1MpaL_hA1X0QNxlZO4GhzQ9NzdNwtOt3S8EYeu1Iy_LmawKtB5QtcaCF6omp3063CEzIWtmiwg-QOGYRjEr0_5xM43pjL3_r07B9kn8ODPn0h5NO8gFH7vXMvERW1egrj2dvzD5-mcSH8AhPIByk | 
    
| linkProvider | Unpaywall | 
    
| linkToUnpaywall | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3da9RAEB_q9UFBWlsVz2pZwYf2IUmT7EfyWMRSBIsPHrRPx35a8EijlyDeX-9Mszl6hSL6tHmYJLs7s5lfdmZ_A_C-st5ofqITLdxJwq0vE0PrKlhjRJBW-kAb-p8v5PmMf7oUl1tQjWdhKK1yhYNP-9V1aq8z1IinHa4u41yoIsszzRPdf8MpaNLWhUewLQWi8Alszy6-nF6NbFGivi1egN5SJRx9fmT1ybnIcqmIhi8dWrXhkJ62eolzE4aiFhuo83HftPr3L71Y3HFAZ7twNXZ9yDv5nvadSe3qHqvj_4ztGexEVMpOBzPagy3f7MPuWPGBxQ_APuzFqyU7inzVx8-hRFNj7Q-WUKIXcwRIUYTFDDB2S-7pHUNzQxfnGIXBly9gdvbx64fzJNZiSGypVJdQNLTUxpaGaoErI7VxPOTa5MrLmjvvCm2dlXXuuXZUikgE7hA-aluZGlHkS5g0N41_Bfj6QktTUci34MGXpuaVLYKqpKuIu2YK6aiQuY1E5VQvYzEfDlGLedRgbNUUjtY3tANHx8OihxsaXstjVyqOP1dTeDeqfI4LjaInuvE3_ZKekHPaosEOsgdkEI4p9P5CTOF4bS5_69Prf5A9gCdD-gLl07yBSfez928RFXXmMC6AP-K8BZU | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+pq-Gram+Distance+between+Ordered+Labeled+Trees&rft.jtitle=ACM+transactions+on+database+systems&rft.au=AUGSTEN%2C+Nikolaus&rft.au=B%C3%96HLEN%2C+Michael&rft.au=GAMPER%2C+Johann&rft.date=2010-02-01&rft.pub=Association+for+Computing+Machinery&rft.issn=0362-5915&rft.volume=35&rft.issue=1&rft_id=info:doi/10.1145%2F1670243.1670247&rft.externalDBID=n%2Fa&rft.externalDocID=22484597 | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0362-5915&client=summon | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0362-5915&client=summon | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0362-5915&client=summon |