The pq -gram distance between ordered labeled trees

When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the...

Full description

Saved in:
Bibliographic Details
Published inACM transactions on database systems Vol. 35; no. 1; pp. 1 - 36
Main Authors Augsten, Nikolaus, Böhlen, Michael, Gamper, Johann
Format Journal Article
LanguageEnglish
Published New York, NY Association for Computing Machinery 01.02.2010
Subjects
Online AccessGet full text
ISSN0362-5915
1557-4644
1557-4644
DOI10.1145/1670243.1670247

Cover

Abstract When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq -grams to approximately match hierarchical data from autonomous sources and define the pq -gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq -gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq -gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq -grams.
AbstractList When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ.
When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq -grams to approximately match hierarchical data from autonomous sources and define the pq -gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq -gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq -gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq -grams.
Author Augsten, Nikolaus
Böhlen, Michael
Gamper, Johann
Author_xml – sequence: 1
  givenname: Nikolaus
  surname: Augsten
  fullname: Augsten, Nikolaus
  organization: Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
– sequence: 2
  givenname: Michael
  surname: Böhlen
  fullname: Böhlen, Michael
  organization: Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
– sequence: 3
  givenname: Johann
  surname: Gamper
  fullname: Gamper, Johann
  organization: Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
BackLink http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=22484597$$DView record in Pascal Francis
BookMark eNqFkL1PwzAQxS1UJNrCzJoFwZLWjr_iEVV8SZVYyhxd7AsEuUlrp6r635OqEQNDmZ50-r13925CRk3bICG3jM4YE3LOlKaZ4LOT6gsyZlLqVCghRmRMucpSaZi8IpMYvymlIjd6TPjqC5PNNkk_A6wTV8cOGotJid0esUna4DCgSzyU6HvtAmK8JpcV-Ig3g07Jx_PTavGaLt9f3haPy9RyrbtUUyU5lJaXjGZMlwpKJyoGJdOojHDoMrDOKsNQgJOcZrISjpl-mpfGZHxK6Cl312zgsAfvi02o1xAOBaPFsXQxlB5U95b7k2UT2u0OY1es62jRe2iw3cVCC66VzqXsyYezZJ_IBDO5Vj16N6AQLfgq9C-q4-8tWSZyIc1xuTxxNrQxBqwKW3fQ1W3TBaj9maPnf3z_1fwBQ9mR5Q
CODEN ATDSD3
CitedBy_id crossref_primary_10_1587_transinf_2015DAP0023
crossref_primary_10_1007_s10845_021_01854_4
crossref_primary_10_1145_2590989_2590994
crossref_primary_10_1016_j_eswa_2024_125973
crossref_primary_10_14778_2095686_2095692
crossref_primary_10_1007_s10472_015_9467_5
crossref_primary_10_1016_j_is_2015_08_004
crossref_primary_10_1016_j_ins_2014_06_007
crossref_primary_10_3103_S0278641915040056
crossref_primary_10_2200_S00544ED1V01Y201310DTM038
crossref_primary_10_14778_3565816_3565833
crossref_primary_10_4018_ijkbo_2013100105
crossref_primary_10_1145_2699485
crossref_primary_10_1007_s10115_012_0582_x
crossref_primary_10_3390_data8090140
crossref_primary_10_1007_s00778_011_0254_6
crossref_primary_10_1007_s00778_012_0263_0
crossref_primary_10_1007_s00778_013_0306_1
crossref_primary_10_1016_j_knosys_2023_110565
crossref_primary_10_1109_TKDE_2010_245
crossref_primary_10_1007_s10115_014_0816_1
crossref_primary_10_14778_3231751_3231760
crossref_primary_10_14778_3654621_3654634
Cites_doi 10.1016/0304-3975(95)80015-8
10.1007/11563983_17
10.1016/B978-155860920-4/50002-7
10.1145/375663.375722
10.1145/564691.564725
10.1016/j.datak.2007.05.008
10.1145/322139.322143
10.1145/1066157.1066243
10.1006/jagm.2001.1170
10.1016/j.is.2004.11.009
10.1016/0304-3975(92)90143-4
10.1145/1007568.1007599
10.1109/TKDE.2005.27
10.1145/1066157.1066207
10.1007/978-3-540-85713-6_18
10.1145/1061318.1061326
10.1137/0218082
10.1145/233269.233366
10.1145/564691.564705
10.1145/1007568.1007686
10.1145/872757.872832
10.1145/564691.564715
10.1109/SPIRE.2001.989761
10.1016/0020-0190(77)90064-3
10.1145/564691.564727
10.1145/375360.375365
10.1109/TKDE.2004.19
10.1007/978-3-540-30078-6_17
10.1016/0031-3203(94)00109-Y
10.1109/ICDE.2008.4497490
10.1007/11687238_46
10.1002/spe.4380210706
ContentType Journal Article
Copyright 2015 INIST-CNRS
Copyright_xml – notice: 2015 INIST-CNRS
DBID AAYXX
CITATION
IQODW
7SC
8FD
JQ2
L7M
L~C
L~D
ADTOC
UNPAY
DOI 10.1145/1670243.1670247
DatabaseName CrossRef
Pascal-Francis
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
Unpaywall for CDI: Periodical Content
Unpaywall
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts
CrossRef
Computer and Information Systems Abstracts
Database_xml – sequence: 1
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Sciences (General)
Computer Science
Applied Sciences
EISSN 1557-4644
EndPage 36
ExternalDocumentID oai:www.zora.uzh.ch:44572
22484597
10_1145_1670243_1670247
GroupedDBID --Z
-DZ
-~X
.DC
23M
4.4
5GY
5VS
6J9
8US
8VB
9M8
AAFWJ
AAKMM
AALFJ
AAYFX
AAYXX
ABFSI
ABPPZ
ACGFO
ACGOD
ACM
ADBCU
ADL
ADMHC
ADMLS
AEBYY
AEFXT
AEGXH
AEJOY
AEMOZ
AENEX
AENSD
AETEA
AFWIH
AFWXC
AHQJS
AI.
AIAGR
AIKLT
AKRVB
AKVCP
ALMA_UNASSIGNED_HOLDINGS
ASPBG
AVWKF
BAAKF
BDXCO
CCLIF
CITATION
CS3
D0L
E.L
EBS
EJD
FEDTE
GUFHI
HF~
HGAVV
H~9
I07
IAO
ICD
IEA
IGS
IOF
ITC
K1G
LHSKQ
MVM
N95
NEJ
NHB
OHT
P1C
P2P
PQQKQ
QWB
RNS
ROL
RXW
TAE
TH9
U5U
UKR
UPT
VH1
WH7
X6Y
XH6
XJT
XOL
XSW
ZCA
ZCG
ZHY
ZL0
ZY4
IQODW
7SC
8FD
JQ2
L7M
L~C
L~D
ABUFD
ADTOC
UNPAY
ID FETCH-LOGICAL-c377t-70653abc3b10217b6abd4f1ab17e694ded2acdc691e4ad53025f4d192ac8b9923
IEDL.DBID UNPAY
ISSN 0362-5915
1557-4644
IngestDate Sun Oct 26 04:09:44 EDT 2025
Fri Jul 11 00:01:09 EDT 2025
Fri Jul 11 10:35:35 EDT 2025
Mon Jul 21 09:11:14 EDT 2025
Thu Apr 24 22:52:44 EDT 2025
Wed Oct 01 06:00:11 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Keywords Lower bound
Input output
distance metric
similarity search
Scalability
XML language
tree edit distance
Tree structure
Graph theory
Weighted graph
Algorithms
approximate matching
hierarchical data
Database algorithms
XML
Database
Ordering
Experimentation
Metric
Graph labelling
Performance
Distance
Language English
License CC BY 4.0
other-oa
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c377t-70653abc3b10217b6abd4f1ab17e694ded2acdc691e4ad53025f4d192ac8b9923
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ObjectType-Article-2
ObjectType-Feature-1
OpenAccessLink https://proxy.k.utb.cz/login?url=https://www.zora.uzh.ch/id/eprint/44572/1/a4-augsten.pdf
PQID 1671419876
PQPubID 23500
PageCount 36
ParticipantIDs unpaywall_primary_10_1145_1670243_1670247
proquest_miscellaneous_743767855
proquest_miscellaneous_1671419876
pascalfrancis_primary_22484597
crossref_citationtrail_10_1145_1670243_1670247
crossref_primary_10_1145_1670243_1670247
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2010-02-01
PublicationDateYYYYMMDD 2010-02-01
PublicationDate_xml – month: 02
  year: 2010
  text: 2010-02-01
  day: 01
PublicationDecade 2010
PublicationPlace New York, NY
PublicationPlace_xml – name: New York, NY
PublicationTitle ACM transactions on database systems
PublicationYear 2010
Publisher Association for Computing Machinery
Publisher_xml – name: Association for Computing Machinery
References Zezula P. (e_1_2_1_42_1); 32
Klein P. N. (e_1_2_1_22_1)
e_1_2_1_41_1
e_1_2_1_40_1
Augsten N. (e_1_2_1_4_1)
e_1_2_1_23_1
e_1_2_1_24_1
e_1_2_1_45_1
e_1_2_1_21_1
e_1_2_1_44_1
e_1_2_1_43_1
e_1_2_1_27_1
e_1_2_1_28_1
e_1_2_1_26_1
Helmer S. (e_1_2_1_19_1) 2007
e_1_2_1_29_1
Gravano L. (e_1_2_1_16_1)
Jiang H. (e_1_2_1_20_1)
Nierman A. (e_1_2_1_25_1)
Celko J. (e_1_2_1_6_1) 1994; 7
e_1_2_1_7_1
e_1_2_1_31_1
Al-Khalifa S. (e_1_2_1_1_1)
e_1_2_1_8_1
e_1_2_1_30_1
e_1_2_1_5_1
e_1_2_1_3_1
e_1_2_1_12_1
Demaine E. D. (e_1_2_1_13_1); 4596
e_1_2_1_35_1
Cobéna G. (e_1_2_1_10_1)
e_1_2_1_34_1
e_1_2_1_33_1
e_1_2_1_2_1
e_1_2_1_11_1
e_1_2_1_32_1
e_1_2_1_39_1
e_1_2_1_17_1
e_1_2_1_38_1
e_1_2_1_14_1
e_1_2_1_37_1
e_1_2_1_15_1
e_1_2_1_36_1
e_1_2_1_9_1
e_1_2_1_18_1
References_xml – ident: e_1_2_1_21_1
  doi: 10.1016/0304-3975(95)80015-8
– ident: e_1_2_1_26_1
  doi: 10.1007/11563983_17
– volume-title: Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 273--284
  ident: e_1_2_1_20_1
– ident: e_1_2_1_7_1
  doi: 10.1016/B978-155860920-4/50002-7
– ident: e_1_2_1_43_1
  doi: 10.1145/375663.375722
– volume-title: Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Science Press, 41--52
  ident: e_1_2_1_10_1
– ident: e_1_2_1_18_1
  doi: 10.1145/564691.564725
– ident: e_1_2_1_31_1
  doi: 10.1016/j.datak.2007.05.008
– ident: e_1_2_1_33_1
  doi: 10.1145/322139.322143
– ident: e_1_2_1_39_1
  doi: 10.1145/1066157.1066243
– ident: e_1_2_1_9_1
  doi: 10.1006/jagm.2001.1170
– volume-title: Proceedings of the International Conference on Very Large Databases (VLDB). ACM Press, 1022--1032
  year: 2007
  ident: e_1_2_1_19_1
– ident: e_1_2_1_11_1
  doi: 10.1016/j.is.2004.11.009
– volume: 32
  volume-title: Advances in Database Systems
  ident: e_1_2_1_42_1
– ident: e_1_2_1_35_1
  doi: 10.1016/0304-3975(92)90143-4
– volume: 4596
  volume-title: Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP'07)
  ident: e_1_2_1_13_1
– volume-title: Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 491--500
  ident: e_1_2_1_16_1
– ident: e_1_2_1_28_1
  doi: 10.1145/1007568.1007599
– ident: e_1_2_1_14_1
  doi: 10.1109/TKDE.2005.27
– volume-title: Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Science Press, 141--152
  ident: e_1_2_1_1_1
– ident: e_1_2_1_37_1
– ident: e_1_2_1_38_1
  doi: 10.1145/1066157.1066207
– ident: e_1_2_1_30_1
  doi: 10.1007/978-3-540-85713-6_18
– ident: e_1_2_1_15_1
  doi: 10.1145/1061318.1061326
– ident: e_1_2_1_45_1
  doi: 10.1137/0218082
– ident: e_1_2_1_8_1
  doi: 10.1145/233269.233366
– ident: e_1_2_1_17_1
  doi: 10.1145/564691.564705
– ident: e_1_2_1_27_1
  doi: 10.1145/1007568.1007686
– volume-title: Proceedings of the International Conference on Very Large Databases (VLDB). ACM Press, 301--312
  ident: e_1_2_1_4_1
– ident: e_1_2_1_12_1
  doi: 10.1145/872757.872832
– volume-title: Proceedings of the 6th European Symposium on Algorithms
  ident: e_1_2_1_22_1
– volume-title: Proceedings of the 5th International Workshop on the Web and Databases (WebDB).
  ident: e_1_2_1_25_1
– ident: e_1_2_1_34_1
  doi: 10.1145/564691.564715
– ident: e_1_2_1_36_1
  doi: 10.1109/SPIRE.2001.989761
– ident: e_1_2_1_32_1
  doi: 10.1016/0020-0190(77)90064-3
– ident: e_1_2_1_5_1
  doi: 10.1145/564691.564727
– volume: 7
  start-page: 48
  year: 1994
  ident: e_1_2_1_6_1
  article-title: Trees, databases and SQL
  publication-title: Datab. Program. Des.
– ident: e_1_2_1_24_1
  doi: 10.1145/375360.375365
– ident: e_1_2_1_23_1
  doi: 10.1109/TKDE.2004.19
– ident: e_1_2_1_3_1
  doi: 10.1007/978-3-540-30078-6_17
– ident: e_1_2_1_44_1
  doi: 10.1016/0031-3203(94)00109-Y
– ident: e_1_2_1_41_1
– ident: e_1_2_1_2_1
  doi: 10.1109/ICDE.2008.4497490
– ident: e_1_2_1_29_1
  doi: 10.1007/11687238_46
– ident: e_1_2_1_40_1
  doi: 10.1002/spe.4380210706
SSID ssj0004897
Score 2.188624
Snippet When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys....
SourceID unpaywall
proquest
pascalfrancis
crossref
SourceType Open Access Repository
Aggregation Database
Index Database
Enrichment Source
StartPage 1
SubjectTerms Algorithmics. Computability. Computer arithmetics
Applied sciences
Approximation
Autonomous
Computer science; control theory; systems
Computer systems and distributed systems. User interface
Exact sciences and technology
Information systems. Data bases
Keys
Matching
Memory organisation. Data processing
Representations
Software
Theoretical computing
Trees
Title The pq -gram distance between ordered labeled trees
URI https://www.proquest.com/docview/1671419876
https://www.proquest.com/docview/743767855
https://www.zora.uzh.ch/id/eprint/44572/1/a4-augsten.pdf
UnpaywallVersion submittedVersion
Volume 35
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVEBS
  databaseName: Inspec with Full Text
  customDbUrl:
  eissn: 1557-4644
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0004897
  issn: 0362-5915
  databaseCode: ADMLS
  dateStart: 19961201
  isFulltext: true
  titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text
  providerName: EBSCOhost
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3da9RAEB_q3YOCWFstPT-OFXxoHzZpkv1IHg-1FLFF0IP26dhPCx5p9BLE--udbTaHJxTRp83DJNndmc38sjP7G4DXpXFasRNFFbcnlBlXUB3WlTdacy-McD5s6J9fiLM5e3_JL3egHM7ChLTKNQ4-6dbXiblOUSMu7HC1KWNc5mmWKkZV9wWnoE4a6-_BWHBE4SMYzy8-zq4Gtihe3RYvQG8pKUOfH1l9MsbTTMhAw5f0rdxySA8btcK58X1Riy3Ueb-rG_Xzh1ouf3NAp7twNXS9zzv5mnStTsz6D1bH_xnbY3gUUSmZ9Wa0Bzuu3ofdoeIDiR-AfdiLVytyFPmqj59AgaZGmm-EhkQvYgMgRRESM8DILbmnswTNDV2cJSEMvnoK89N3n9-c0ViLgZpCypaGaGihtCl0qAUutVDaMp8pnUknKmadzZWxRlSZY8qGUkTcM4vwUZlSV4giD2BU39TuEPD1uRK6DCHfnHlX6IqVJveyFLYM3DUTSAaFLEwkKg_1MpaL_hA1X0QNxlZO4GhzQ9NzdNwtOt3S8EYeu1Iy_LmawKtB5QtcaCF6omp3063CEzIWtmiwg-QOGYRjEr0_5xM43pjL3_r07B9kn8ODPn0h5NO8gFH7vXMvERW1egrj2dvzD5-mcSH8AhPIByk
linkProvider Unpaywall
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3da9RAEB_q9UFBWlsVz2pZwYf2IUmT7EfyWMRSBIsPHrRPx35a8EijlyDeX-9Mszl6hSL6tHmYJLs7s5lfdmZ_A_C-st5ofqITLdxJwq0vE0PrKlhjRJBW-kAb-p8v5PmMf7oUl1tQjWdhKK1yhYNP-9V1aq8z1IinHa4u41yoIsszzRPdf8MpaNLWhUewLQWi8Alszy6-nF6NbFGivi1egN5SJRx9fmT1ybnIcqmIhi8dWrXhkJ62eolzE4aiFhuo83HftPr3L71Y3HFAZ7twNXZ9yDv5nvadSe3qHqvj_4ztGexEVMpOBzPagy3f7MPuWPGBxQ_APuzFqyU7inzVx8-hRFNj7Q-WUKIXcwRIUYTFDDB2S-7pHUNzQxfnGIXBly9gdvbx64fzJNZiSGypVJdQNLTUxpaGaoErI7VxPOTa5MrLmjvvCm2dlXXuuXZUikgE7hA-aluZGlHkS5g0N41_Bfj6QktTUci34MGXpuaVLYKqpKuIu2YK6aiQuY1E5VQvYzEfDlGLedRgbNUUjtY3tANHx8OihxsaXstjVyqOP1dTeDeqfI4LjaInuvE3_ZKekHPaosEOsgdkEI4p9P5CTOF4bS5_69Prf5A9gCdD-gLl07yBSfez928RFXXmMC6AP-K8BZU
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+pq-Gram+Distance+between+Ordered+Labeled+Trees&rft.jtitle=ACM+transactions+on+database+systems&rft.au=AUGSTEN%2C+Nikolaus&rft.au=B%C3%96HLEN%2C+Michael&rft.au=GAMPER%2C+Johann&rft.date=2010-02-01&rft.pub=Association+for+Computing+Machinery&rft.issn=0362-5915&rft.volume=35&rft.issue=1&rft_id=info:doi/10.1145%2F1670243.1670247&rft.externalDBID=n%2Fa&rft.externalDocID=22484597
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0362-5915&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0362-5915&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0362-5915&client=summon