DiffSearch: A Scalable and Precise Search Engine for Code Changes

The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper pre...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on software engineering Vol. 49; no. 4; pp. 1 - 16
Main Authors Grazia, Luca Di, Bredl, Paul, Pradel, Michael
Format Journal Article
LanguageEnglish
Published New York IEEE 01.04.2023
IEEE Computer Society
Subjects
Online AccessGet full text
ISSN0098-5589
1939-3520
DOI10.1109/TSE.2022.3218859

Cover

Abstract The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper presents DiffSearch, a search engine that, given a query that describes a code change, returns a set of changes that match the query. The approach is enabled by three key contributions. First, we present a query language that extends the underlying programming language with wildcards and placeholders, providing an intuitive way of formulating queries that is easy to adapt to different programming languages. Second, to ensure scalability, the approach indexes code changes in a one-time preprocessing step, mapping them into a feature space, and then performs an efficient search in the feature space for each query. Third, to guarantee precision, i.e., that any returned code change indeed matches the given query, we present a tree-based matching algorithm that checks whether a query can be expanded to a concrete code change. We present implementations for Java, JavaScript, and Python, and show that the approach responds within seconds to queries across one million code changes, has a recall of 80.7% for Java, 89.6% for Python, and 90.4% for JavaScript, enables users to find relevant code changes more effectively than a regular expression-based search and GitHub's search feature, and is helpful for gathering a large-scale dataset of real-world bug fixes.
AbstractList The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper presents DiffSearch, a search engine that, given a query that describes a code change, returns a set of changes that match the query. The approach is enabled by three key contributions. First, we present a query language that extends the underlying programming language with wildcards and placeholders, providing an intuitive way of formulating queries that is easy to adapt to different programming languages. Second, to ensure scalability, the approach indexes code changes in a one-time preprocessing step, mapping them into a feature space, and then performs an efficient search in the feature space for each query. Third, to guarantee precision, i.e., that any returned code change indeed matches the given query, we present a tree-based matching algorithm that checks whether a query can be expanded to a concrete code change. We present implementations for Java, JavaScript, and Python, and show that the approach responds within seconds to queries across one million code changes, has a recall of 80.7% for Java, 89.6% for Python, and 90.4% for JavaScript, enables users to find relevant code changes more effectively than a regular expression-based search and GitHub's search feature, and is helpful for gathering a large-scale dataset of real-world bug fixes.
Author Grazia, Luca Di
Pradel, Michael
Bredl, Paul
Author_xml – sequence: 1
  givenname: Luca Di
  orcidid: 0000-0002-5306-8645
  surname: Grazia
  fullname: Grazia, Luca Di
  organization: Department of Computer Science, University of Stuttgart, Germany
– sequence: 2
  givenname: Paul
  surname: Bredl
  fullname: Bredl, Paul
  organization: Department of Computer Science, University of Stuttgart, Germany
– sequence: 3
  givenname: Michael
  surname: Pradel
  fullname: Pradel, Michael
  organization: Department of Computer Science, University of Stuttgart, Germany
BookMark eNp9kE1LAzEQhoNUsFbvgpeA56352GwSb6XWDygotJ5DdnfSblmzNdke_PembPHgwdPAvO8zA88lGvnOA0I3lEwpJfp-vVpMGWFsyhlVSugzNKaa64wLRkZoTIhWmRBKX6DLGHeEECGlGKPZY-PcCmyotg94hleVbW3ZAra-xu8BqiYCHmK88JvGA3ZdwPOuBjzfWr-BeIXOnW0jXJ_mBH08Ldbzl2z59vw6ny2zigvZZ7Vk4FxeAnNCKaCOqaKuQTBauFo6qkrhmFCEcwc8rbUsJeQpcFaWRVXzCbob7u5D93WA2Jtddwg-vTQsYSSXhZCpRYZWFboYAzizD82nDd-GEnMUZZIocxRlTqISUvxBqqa3fdP5Ptim_Q-8HcAGAH7_aJ2UFzn_AQ-sddA
CODEN IESEDJ
CitedBy_id crossref_primary_10_1145_3565971
crossref_primary_10_1109_ACCESS_2024_3427829
crossref_primary_10_1038_s41598_024_54894_0
Cites_doi 10.1145/3133928
10.1145/3290353
10.1145/2950290.2950333
10.1145/3106237.3106293
10.1109/ICSE.2007.30
10.1145/3360578
10.1109/ICSE43902.2021.00135
10.1109/TBDATA.2019.2921572
10.1109/ICSE.2019.00089
10.1109/ICSE.2015.35
10.1109/SANER.2016.76
10.1145/3368089.3409693
10.1145/3180155.3180187
10.1109/32.295894
10.1145/3474624.3474650
10.1145/2786805.2786855
10.1145/3360585
10.1145/3192366.3192403
10.1109/ICSE.2019.00044
10.1109/ICSE.2017.47
10.1145/1176617.1176671
10.1145/1858996.1859005
10.1145/2884781.2884877
10.1145/3180155.3180167
10.1145/3318162
10.1145/3468264.3468605
10.1109/ICSE.2013.6606596
10.1007/s10664-008-9077-5
10.1145/3565971
10.1109/ICSE.2009.5070525
10.1109/ICSE-C.2017.76
10.1109/MSR.2019.00016
10.1109/ICSE.2019.00021
10.1145/3276517
10.1145/3453483.3454052
10.1145/3211346.3211353
10.1109/TSE.2006.28
10.1109/ICSE.2017.44
10.1145/1985441.1985456
10.1145/1993498.1993537
10.1109/TSE.2007.70773
10.1145/3238147.3238213
10.1145/2970276.2970359
10.1109/ICPC.2008.41
10.1109/TSE.2007.70731
10.1109/TSE.2002.1019480
10.1109/ASE.2013.6693078
10.1145/3385412.3386001
10.1145/3379597.3387491
10.1145/1985793.1985842
10.1145/3236024.3236047
10.1109/ICPC.2008.24
10.1145/3338906.3340458
10.1007/s42979-021-00566-z
10.1145/3426422.3426981
10.1145/2642937.2642982
10.1109/ICSE43902.2021.00020
10.1145/1081706.1081754
10.1145/3428287
10.1007/s10664-017-9544-y
10.1109/WCRE.2008.44
10.1109/TSE.2020.2998785
10.1145/2568225.2568317
ContentType Journal Article
Copyright Copyright IEEE Computer Society 2023
Copyright_xml – notice: Copyright IEEE Computer Society 2023
DBID 97E
RIA
RIE
AAYXX
CITATION
JQ2
K9.
DOI 10.1109/TSE.2022.3218859
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Xplore
CrossRef
ProQuest Computer Science Collection
ProQuest Health & Medical Complete (Alumni)
DatabaseTitle CrossRef
ProQuest Health & Medical Complete (Alumni)
ProQuest Computer Science Collection
DatabaseTitleList ProQuest Health & Medical Complete (Alumni)

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1939-3520
EndPage 16
ExternalDocumentID 10_1109_TSE_2022_3218859
9935264
Genre orig-research
GroupedDBID --Z
-DZ
-~X
.4S
.DC
0R~
29I
3EH
4.4
5GY
5VS
6IK
7WY
7X7
85S
88E
88I
8FE
8FG
8FI
8FJ
8FL
8G5
8R4
8R5
97E
9M8
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABFSI
ABJCF
ABPPZ
ABQJQ
ABUWG
ABVLG
ACGFO
ACGOD
ACIWK
ACNCT
ADBBV
AENEX
AETIX
AFKRA
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ARAPS
ARCSS
ASUFR
ATWAV
AZQEC
BEFXN
BENPR
BEZIV
BFFAM
BGLVJ
BGNUA
BKEBE
BKOMP
BPEOZ
BPHCQ
BVXVI
CCPQU
CS3
DU5
DWQXO
E.L
EBS
EDO
EJD
FRNLG
FYUFA
GNUQQ
GROUPED_ABI_INFORM_RESEARCH
GUQSH
HCIFZ
HMCUK
HZ~
H~9
I-F
IBMZZ
ICLAB
IEDLZ
IFIPE
IFJZH
IPLJI
ITG
ITH
JAVBF
K60
K6V
K6~
K7-
L6V
LAI
M0C
M1P
M1Q
M2O
M2P
M43
M7S
MS~
O9-
OCL
OHT
P2P
P62
PHGZM
PHGZT
PJZUB
PPXIY
PQBIZ
PQBZA
PQGLB
PQQKQ
PROAC
PSQYO
PTHSS
PUEGO
Q2X
RIA
RIE
RNI
RNS
RXW
RZB
S10
TAE
TN5
TWZ
UHB
UKHRP
UPT
UQL
VH1
WH7
XOL
YYP
YZZ
ZCG
AAYXX
CITATION
JQ2
K9.
ID FETCH-LOGICAL-c357t-d72eff4be2f588e1f286dde5216fd7f18b5f258033fe3e5297b7e47f1fa7b6cd3
IEDL.DBID RIE
ISSN 0098-5589
IngestDate Fri Oct 03 03:21:16 EDT 2025
Thu Apr 24 23:06:47 EDT 2025
Wed Oct 01 02:36:07 EDT 2025
Wed Aug 27 02:29:08 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 4
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c357t-d72eff4be2f588e1f286dde5216fd7f18b5f258033fe3e5297b7e47f1fa7b6cd3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-5306-8645
0000-0003-1623-498X
PQID 2803047657
PQPubID 21418
PageCount 16
ParticipantIDs crossref_primary_10_1109_TSE_2022_3218859
crossref_citationtrail_10_1109_TSE_2022_3218859
proquest_journals_2803047657
ieee_primary_9935264
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2023-04-01
PublicationDateYYYYMMDD 2023-04-01
PublicationDate_xml – month: 04
  year: 2023
  text: 2023-04-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on software engineering
PublicationTitleAbbrev TSE
PublicationYear 2023
Publisher IEEE
IEEE Computer Society
Publisher_xml – name: IEEE
– name: IEEE Computer Society
References ref57
ref12
ref56
ref15
ref59
ref14
ref58
ref53
ref52
ref11
ref55
ref10
ref54
ref17
ref16
ref19
ref18
Hoang (ref40)
ref51
ref50
ref46
ref45
ref48
ref47
ref42
ref41
ref44
ref43
ref49
ref8
Wu (ref66) 2018; 23
ref7
ref4
ref3
ref6
ref5
ref35
ref34
Lawall (ref13)
ref37
ref31
ref30
ref33
ref2
ref1
ref39
ref38
ref70
ref24
ref68
ref23
ref67
ref26
ref25
Lawall (ref36)
ref20
ref64
ref63
ref22
ref21
ref65
Herzig (ref32) 2016; 21
ref28
ref27
Inoue (ref69) 2020
ref29
ref60
Eghbali (ref9)
ref62
ref61
References_xml – ident: ref29
  doi: 10.1145/3133928
– ident: ref49
  doi: 10.1145/3290353
– ident: ref57
  doi: 10.1145/2950290.2950333
– ident: ref46
  doi: 10.1145/3106237.3106293
– ident: ref26
  doi: 10.1109/ICSE.2007.30
– ident: ref17
  doi: 10.1145/3360578
– ident: ref68
  doi: 10.1109/ICSE43902.2021.00135
– ident: ref34
  doi: 10.1109/TBDATA.2019.2921572
– ident: ref8
  doi: 10.1109/ICSE.2019.00089
– ident: ref31
  doi: 10.1109/ICSE.2015.35
– ident: ref59
  doi: 10.1109/SANER.2016.76
– ident: ref33
  doi: 10.1145/3368089.3409693
– ident: ref18
  doi: 10.1145/3180155.3180187
– ident: ref45
  doi: 10.1109/32.295894
– volume: 23
  start-page: 2866
  issue: 5
  volume-title: Empir. Softw. Eng.
  year: 2018
  ident: ref66
  article-title: ChangeLocator: Locate crash-inducing changes based on crash reports
– ident: ref58
  doi: 10.1145/3474624.3474650
– ident: ref38
  doi: 10.1145/2786805.2786855
– ident: ref5
  doi: 10.1145/3360585
– ident: ref23
  doi: 10.1145/3192366.3192403
– ident: ref47
  doi: 10.1109/ICSE.2019.00044
– ident: ref70
  doi: 10.1109/ICSE.2017.47
– ident: ref14
  doi: 10.1145/1176617.1176671
– ident: ref62
  doi: 10.1145/1858996.1859005
– start-page: 518
  volume-title: Proc. IEEE/ACM 42nd Int. Conf. Softw. Eng.
  ident: ref40
  article-title: CC2Vec: Distributed representations of code changes
– ident: ref28
  doi: 10.1145/2884781.2884877
– ident: ref15
  doi: 10.1145/3180155.3180167
– ident: ref2
  doi: 10.1145/3318162
– ident: ref67
  doi: 10.1145/3468264.3468605
– ident: ref52
  doi: 10.1109/ICSE.2013.6606596
– ident: ref12
  doi: 10.1007/s10664-008-9077-5
– ident: ref35
  doi: 10.1145/3565971
– ident: ref16
  doi: 10.1109/ICSE.2009.5070525
– ident: ref3
  doi: 10.1109/ICSE-C.2017.76
– ident: ref64
  doi: 10.1109/MSR.2019.00016
– start-page: 15
  volume-title: Proc. USENIX Annu. Tech. Conf.
  ident: ref13
  article-title: Fast and precise retrieval of forward and back porting information for linux device drivers
– ident: ref54
  doi: 10.1109/ICSE.2019.00021
– ident: ref39
  doi: 10.1145/3276517
– ident: ref22
  doi: 10.1145/3453483.3454052
– ident: ref43
  doi: 10.1145/3211346.3211353
– ident: ref25
  doi: 10.1109/TSE.2006.28
– ident: ref20
  doi: 10.1109/ICSE.2017.44
– ident: ref63
  doi: 10.1145/1985441.1985456
– ident: ref51
  doi: 10.1145/1993498.1993537
– ident: ref56
  doi: 10.1109/TSE.2007.70773
– ident: ref1
  doi: 10.1145/3238147.3238213
– volume: 21
  start-page: 303
  issue: 2
  volume-title: Empir. Softw. Eng.
  year: 2016
  ident: ref32
  article-title: The impact of tangled code changes on defect prediction models
– ident: ref65
  doi: 10.1145/2970276.2970359
– ident: ref27
  doi: 10.1109/ICPC.2008.41
– ident: ref10
  doi: 10.1109/TSE.2007.70731
– ident: ref24
  doi: 10.1109/TSE.2002.1019480
– ident: ref53
  doi: 10.1109/ASE.2013.6693078
– ident: ref42
  doi: 10.1145/3385412.3386001
– ident: ref37
  doi: 10.1145/3379597.3387491
– ident: ref11
  doi: 10.1145/1985793.1985842
– year: 2020
  ident: ref69
  article-title: Code clone matching: A practical and effective approach to find code snippets
– ident: ref60
  doi: 10.1145/3236024.3236047
– ident: ref30
  doi: 10.1109/ICPC.2008.24
– ident: ref44
  doi: 10.1145/3338906.3340458
– start-page: 956
  volume-title: Proc. IEEE/ACM 35th Int. Conf. Autom. Softw. Eng.
  ident: ref9
  article-title: No strings attached: An empirical study of string-related software bugs
– ident: ref41
  doi: 10.1007/s42979-021-00566-z
– ident: ref7
  doi: 10.1145/3426422.3426981
– ident: ref19
  doi: 10.1145/2642937.2642982
– ident: ref61
  doi: 10.1109/ICSE43902.2021.00020
– ident: ref55
  doi: 10.1145/1081706.1081754
– ident: ref21
  doi: 10.1145/3428287
– ident: ref48
  doi: 10.1007/s10664-017-9544-y
– ident: ref50
  doi: 10.1109/WCRE.2008.44
– ident: ref4
  doi: 10.1109/TSE.2020.2998785
– ident: ref6
  doi: 10.1145/2568225.2568317
– start-page: 601
  volume-title: Proc. USENIX Annu. Tech. Conf.
  ident: ref36
  article-title: Coccinelle: 10 years of automated evolution in the linux kernel
SSID ssj0005775
ssib053395008
Score 2.43302
Snippet The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1
SubjectTerms Algorithms
Codes
Database languages
Feature extraction
Indexing
Java
Program analysis
Python
Queries
Query languages
Search engines
Search problems
software engineering
software maintenance
Source code
Title DiffSearch: A Scalable and Precise Search Engine for Code Changes
URI https://ieeexplore.ieee.org/document/9935264
https://www.proquest.com/docview/2803047657
Volume 49
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1939-3520
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0005775
  issn: 0098-5589
  databaseCode: RIE
  dateStart: 19750101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEB7anrxYtYrVKjl4EdztPvJab6W2FKEi2EJvy2aTgFhasduLv95ks1tFRbwteUDIZPLNbGa-AbgKEylFSJWHcyk8zEPsCU2NulPLvaJZhIVNTp4-0Mkc3y_IogE3u1wYpVQZfKZ8-1m-5ct1vrW_yvoGS4kB8CY0GacuV-sznIMxUvNjEsKT-kkySPqzp5FxBKPIjw2ecctK-gWCypoqPy7iEl3GbZjW63JBJS_-thB-_v6NsvG_Cz-A_crMRAN3Lg6hoVZH0K5LOKBKozswuHvW2oUc36KBac-WNpcKZSuJHi3xxUYh140ccyEyRi4arqVCLi9hcwzz8Wg2nHhVVQUvjwkrPMkipTUWKtKEcxXqiFNzxxkYp1oyHXJBdER4EMdaxaY5YYIpbDp0xgTNZXwCrdV6pU4BJRGNKeWUUWH9RMUlISxjWcBxaCSSdaFfb3SaV5TjtvLFMi1djyBJjWhSK5q0Ek0XrnczXh3dxh9jO3and-OqTe5Cr5ZlWunjJrU1uALMKGFnv886hz1bSN7F5PSgVbxt1YUxNwpxWZ6zDw58zlc
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEB5qPejF-sRq1Ry8CG67j7zWW1FL1bYIttDbstkkIJZWbHvx15tsdquoiLclDwiZTL6Zzcw3AOdBLKUIqPJwJoWHeYA9oalRd2q5VzQLsbDJyf0B7Y7w_ZiMK3C5yoVRSuXBZ6ppP_O3fDnLlvZXWctgKTEAvgbrBGNMXLbWZ0AHY6RkyCSEx-WjpB-3hk-3xhUMw2ZkEI1bXtIvIJRXVflxFef40qlBv1yZCyt5aS4Xopm9fyNt_O_St2GrMDRR252MHaio6S7UyiIOqNDpPWjfPGvtgo6vUNu0pxObTYXSqUSPlvpirpDrRo67EBkzF13PpEIuM2G-D6PO7fC66xV1FbwsImzhSRYqrbFQoSacq0CHnJpbzgA51ZLpgAuiQ8L9KNIqMs0xE0xh06FTJmgmowOoTmdTdQgoDmlEKaeMCuspKi4JYSlLfY4DI5G0Dq1yo5OsIB23tS8mSe58-HFiRJNY0SSFaOpwsZrx6gg3_hi7Z3d6Na7Y5Do0SlkmhUbOE1uFy8eMEnb0-6wz2OgO-72kdzd4OIZNW1beReg0oLp4W6oTY3wsxGl-5j4ACyTRpA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DiffSearch%3A+A+Scalable+and+Precise+Search+Engine+for+Code+Changes&rft.jtitle=IEEE+transactions+on+software+engineering&rft.au=Grazia%2C+Luca+Di&rft.au=Bredl%2C+Paul&rft.au=Pradel%2C+Michael&rft.date=2023-04-01&rft.issn=0098-5589&rft.eissn=1939-3520&rft.volume=49&rft.issue=4&rft.spage=2366&rft.epage=2380&rft_id=info:doi/10.1109%2FTSE.2022.3218859&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TSE_2022_3218859
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0098-5589&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0098-5589&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0098-5589&client=summon