DiffSearch: A Scalable and Precise Search Engine for Code Changes
The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper pre...
        Saved in:
      
    
          | Published in | IEEE transactions on software engineering Vol. 49; no. 4; pp. 1 - 16 | 
|---|---|
| Main Authors | , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        New York
          IEEE
    
        01.04.2023
     IEEE Computer Society  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0098-5589 1939-3520  | 
| DOI | 10.1109/TSE.2022.3218859 | 
Cover
| Abstract | The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper presents DiffSearch, a search engine that, given a query that describes a code change, returns a set of changes that match the query. The approach is enabled by three key contributions. First, we present a query language that extends the underlying programming language with wildcards and placeholders, providing an intuitive way of formulating queries that is easy to adapt to different programming languages. Second, to ensure scalability, the approach indexes code changes in a one-time preprocessing step, mapping them into a feature space, and then performs an efficient search in the feature space for each query. Third, to guarantee precision, i.e., that any returned code change indeed matches the given query, we present a tree-based matching algorithm that checks whether a query can be expanded to a concrete code change. We present implementations for Java, JavaScript, and Python, and show that the approach responds within seconds to queries across one million code changes, has a recall of 80.7% for Java, 89.6% for Python, and 90.4% for JavaScript, enables users to find relevant code changes more effectively than a regular expression-based search and GitHub's search feature, and is helpful for gathering a large-scale dataset of real-world bug fixes. | 
    
|---|---|
| AbstractList | The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper presents DiffSearch, a search engine that, given a query that describes a code change, returns a set of changes that match the query. The approach is enabled by three key contributions. First, we present a query language that extends the underlying programming language with wildcards and placeholders, providing an intuitive way of formulating queries that is easy to adapt to different programming languages. Second, to ensure scalability, the approach indexes code changes in a one-time preprocessing step, mapping them into a feature space, and then performs an efficient search in the feature space for each query. Third, to guarantee precision, i.e., that any returned code change indeed matches the given query, we present a tree-based matching algorithm that checks whether a query can be expanded to a concrete code change. We present implementations for Java, JavaScript, and Python, and show that the approach responds within seconds to queries across one million code changes, has a recall of 80.7% for Java, 89.6% for Python, and 90.4% for JavaScript, enables users to find relevant code changes more effectively than a regular expression-based search and GitHub's search feature, and is helpful for gathering a large-scale dataset of real-world bug fixes. | 
    
| Author | Grazia, Luca Di Pradel, Michael Bredl, Paul  | 
    
| Author_xml | – sequence: 1 givenname: Luca Di orcidid: 0000-0002-5306-8645 surname: Grazia fullname: Grazia, Luca Di organization: Department of Computer Science, University of Stuttgart, Germany – sequence: 2 givenname: Paul surname: Bredl fullname: Bredl, Paul organization: Department of Computer Science, University of Stuttgart, Germany – sequence: 3 givenname: Michael surname: Pradel fullname: Pradel, Michael organization: Department of Computer Science, University of Stuttgart, Germany  | 
    
| BookMark | eNp9kE1LAzEQhoNUsFbvgpeA56352GwSb6XWDygotJ5DdnfSblmzNdke_PembPHgwdPAvO8zA88lGvnOA0I3lEwpJfp-vVpMGWFsyhlVSugzNKaa64wLRkZoTIhWmRBKX6DLGHeEECGlGKPZY-PcCmyotg94hleVbW3ZAra-xu8BqiYCHmK88JvGA3ZdwPOuBjzfWr-BeIXOnW0jXJ_mBH08Ldbzl2z59vw6ny2zigvZZ7Vk4FxeAnNCKaCOqaKuQTBauFo6qkrhmFCEcwc8rbUsJeQpcFaWRVXzCbob7u5D93WA2Jtddwg-vTQsYSSXhZCpRYZWFboYAzizD82nDd-GEnMUZZIocxRlTqISUvxBqqa3fdP5Ptim_Q-8HcAGAH7_aJ2UFzn_AQ-sddA | 
    
| CODEN | IESEDJ | 
    
| CitedBy_id | crossref_primary_10_1145_3565971 crossref_primary_10_1109_ACCESS_2024_3427829 crossref_primary_10_1038_s41598_024_54894_0  | 
    
| Cites_doi | 10.1145/3133928 10.1145/3290353 10.1145/2950290.2950333 10.1145/3106237.3106293 10.1109/ICSE.2007.30 10.1145/3360578 10.1109/ICSE43902.2021.00135 10.1109/TBDATA.2019.2921572 10.1109/ICSE.2019.00089 10.1109/ICSE.2015.35 10.1109/SANER.2016.76 10.1145/3368089.3409693 10.1145/3180155.3180187 10.1109/32.295894 10.1145/3474624.3474650 10.1145/2786805.2786855 10.1145/3360585 10.1145/3192366.3192403 10.1109/ICSE.2019.00044 10.1109/ICSE.2017.47 10.1145/1176617.1176671 10.1145/1858996.1859005 10.1145/2884781.2884877 10.1145/3180155.3180167 10.1145/3318162 10.1145/3468264.3468605 10.1109/ICSE.2013.6606596 10.1007/s10664-008-9077-5 10.1145/3565971 10.1109/ICSE.2009.5070525 10.1109/ICSE-C.2017.76 10.1109/MSR.2019.00016 10.1109/ICSE.2019.00021 10.1145/3276517 10.1145/3453483.3454052 10.1145/3211346.3211353 10.1109/TSE.2006.28 10.1109/ICSE.2017.44 10.1145/1985441.1985456 10.1145/1993498.1993537 10.1109/TSE.2007.70773 10.1145/3238147.3238213 10.1145/2970276.2970359 10.1109/ICPC.2008.41 10.1109/TSE.2007.70731 10.1109/TSE.2002.1019480 10.1109/ASE.2013.6693078 10.1145/3385412.3386001 10.1145/3379597.3387491 10.1145/1985793.1985842 10.1145/3236024.3236047 10.1109/ICPC.2008.24 10.1145/3338906.3340458 10.1007/s42979-021-00566-z 10.1145/3426422.3426981 10.1145/2642937.2642982 10.1109/ICSE43902.2021.00020 10.1145/1081706.1081754 10.1145/3428287 10.1007/s10664-017-9544-y 10.1109/WCRE.2008.44 10.1109/TSE.2020.2998785 10.1145/2568225.2568317  | 
    
| ContentType | Journal Article | 
    
| Copyright | Copyright IEEE Computer Society 2023 | 
    
| Copyright_xml | – notice: Copyright IEEE Computer Society 2023 | 
    
| DBID | 97E RIA RIE AAYXX CITATION JQ2 K9.  | 
    
| DOI | 10.1109/TSE.2022.3218859 | 
    
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Xplore CrossRef ProQuest Computer Science Collection ProQuest Health & Medical Complete (Alumni)  | 
    
| DatabaseTitle | CrossRef ProQuest Health & Medical Complete (Alumni) ProQuest Computer Science Collection  | 
    
| DatabaseTitleList | ProQuest Health & Medical Complete (Alumni) | 
    
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Computer Science | 
    
| EISSN | 1939-3520 | 
    
| EndPage | 16 | 
    
| ExternalDocumentID | 10_1109_TSE_2022_3218859 9935264  | 
    
| Genre | orig-research | 
    
| GroupedDBID | --Z -DZ -~X .4S .DC 0R~ 29I 3EH 4.4 5GY 5VS 6IK 7WY 7X7 85S 88E 88I 8FE 8FG 8FI 8FJ 8FL 8G5 8R4 8R5 97E 9M8 AAJGR AARMG AASAJ AAWTH ABAZT ABFSI ABJCF ABPPZ ABQJQ ABUWG ABVLG ACGFO ACGOD ACIWK ACNCT ADBBV AENEX AETIX AFKRA AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ARAPS ARCSS ASUFR ATWAV AZQEC BEFXN BENPR BEZIV BFFAM BGLVJ BGNUA BKEBE BKOMP BPEOZ BPHCQ BVXVI CCPQU CS3 DU5 DWQXO E.L EBS EDO EJD FRNLG FYUFA GNUQQ GROUPED_ABI_INFORM_RESEARCH GUQSH HCIFZ HMCUK HZ~ H~9 I-F IBMZZ ICLAB IEDLZ IFIPE IFJZH IPLJI ITG ITH JAVBF K60 K6V K6~ K7- L6V LAI M0C M1P M1Q M2O M2P M43 M7S MS~ O9- OCL OHT P2P P62 PHGZM PHGZT PJZUB PPXIY PQBIZ PQBZA PQGLB PQQKQ PROAC PSQYO PTHSS PUEGO Q2X RIA RIE RNI RNS RXW RZB S10 TAE TN5 TWZ UHB UKHRP UPT UQL VH1 WH7 XOL YYP YZZ ZCG AAYXX CITATION JQ2 K9.  | 
    
| ID | FETCH-LOGICAL-c357t-d72eff4be2f588e1f286dde5216fd7f18b5f258033fe3e5297b7e47f1fa7b6cd3 | 
    
| IEDL.DBID | RIE | 
    
| ISSN | 0098-5589 | 
    
| IngestDate | Fri Oct 03 03:21:16 EDT 2025 Thu Apr 24 23:06:47 EDT 2025 Wed Oct 01 02:36:07 EDT 2025 Wed Aug 27 02:29:08 EDT 2025  | 
    
| IsPeerReviewed | true | 
    
| IsScholarly | true | 
    
| Issue | 4 | 
    
| Language | English | 
    
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037  | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-c357t-d72eff4be2f588e1f286dde5216fd7f18b5f258033fe3e5297b7e47f1fa7b6cd3 | 
    
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14  | 
    
| ORCID | 0000-0002-5306-8645 0000-0003-1623-498X  | 
    
| PQID | 2803047657 | 
    
| PQPubID | 21418 | 
    
| PageCount | 16 | 
    
| ParticipantIDs | crossref_primary_10_1109_TSE_2022_3218859 crossref_citationtrail_10_1109_TSE_2022_3218859 proquest_journals_2803047657 ieee_primary_9935264  | 
    
| ProviderPackageCode | CITATION AAYXX  | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 2023-04-01 | 
    
| PublicationDateYYYYMMDD | 2023-04-01 | 
    
| PublicationDate_xml | – month: 04 year: 2023 text: 2023-04-01 day: 01  | 
    
| PublicationDecade | 2020 | 
    
| PublicationPlace | New York | 
    
| PublicationPlace_xml | – name: New York | 
    
| PublicationTitle | IEEE transactions on software engineering | 
    
| PublicationTitleAbbrev | TSE | 
    
| PublicationYear | 2023 | 
    
| Publisher | IEEE IEEE Computer Society  | 
    
| Publisher_xml | – name: IEEE – name: IEEE Computer Society  | 
    
| References | ref57 ref12 ref56 ref15 ref59 ref14 ref58 ref53 ref52 ref11 ref55 ref10 ref54 ref17 ref16 ref19 ref18 Hoang (ref40) ref51 ref50 ref46 ref45 ref48 ref47 ref42 ref41 ref44 ref43 ref49 ref8 Wu (ref66) 2018; 23 ref7 ref4 ref3 ref6 ref5 ref35 ref34 Lawall (ref13) ref37 ref31 ref30 ref33 ref2 ref1 ref39 ref38 ref70 ref24 ref68 ref23 ref67 ref26 ref25 Lawall (ref36) ref20 ref64 ref63 ref22 ref21 ref65 Herzig (ref32) 2016; 21 ref28 ref27 Inoue (ref69) 2020 ref29 ref60 Eghbali (ref9) ref62 ref61  | 
    
| References_xml | – ident: ref29 doi: 10.1145/3133928 – ident: ref49 doi: 10.1145/3290353 – ident: ref57 doi: 10.1145/2950290.2950333 – ident: ref46 doi: 10.1145/3106237.3106293 – ident: ref26 doi: 10.1109/ICSE.2007.30 – ident: ref17 doi: 10.1145/3360578 – ident: ref68 doi: 10.1109/ICSE43902.2021.00135 – ident: ref34 doi: 10.1109/TBDATA.2019.2921572 – ident: ref8 doi: 10.1109/ICSE.2019.00089 – ident: ref31 doi: 10.1109/ICSE.2015.35 – ident: ref59 doi: 10.1109/SANER.2016.76 – ident: ref33 doi: 10.1145/3368089.3409693 – ident: ref18 doi: 10.1145/3180155.3180187 – ident: ref45 doi: 10.1109/32.295894 – volume: 23 start-page: 2866 issue: 5 volume-title: Empir. Softw. Eng. year: 2018 ident: ref66 article-title: ChangeLocator: Locate crash-inducing changes based on crash reports – ident: ref58 doi: 10.1145/3474624.3474650 – ident: ref38 doi: 10.1145/2786805.2786855 – ident: ref5 doi: 10.1145/3360585 – ident: ref23 doi: 10.1145/3192366.3192403 – ident: ref47 doi: 10.1109/ICSE.2019.00044 – ident: ref70 doi: 10.1109/ICSE.2017.47 – ident: ref14 doi: 10.1145/1176617.1176671 – ident: ref62 doi: 10.1145/1858996.1859005 – start-page: 518 volume-title: Proc. IEEE/ACM 42nd Int. Conf. Softw. Eng. ident: ref40 article-title: CC2Vec: Distributed representations of code changes – ident: ref28 doi: 10.1145/2884781.2884877 – ident: ref15 doi: 10.1145/3180155.3180167 – ident: ref2 doi: 10.1145/3318162 – ident: ref67 doi: 10.1145/3468264.3468605 – ident: ref52 doi: 10.1109/ICSE.2013.6606596 – ident: ref12 doi: 10.1007/s10664-008-9077-5 – ident: ref35 doi: 10.1145/3565971 – ident: ref16 doi: 10.1109/ICSE.2009.5070525 – ident: ref3 doi: 10.1109/ICSE-C.2017.76 – ident: ref64 doi: 10.1109/MSR.2019.00016 – start-page: 15 volume-title: Proc. USENIX Annu. Tech. Conf. ident: ref13 article-title: Fast and precise retrieval of forward and back porting information for linux device drivers – ident: ref54 doi: 10.1109/ICSE.2019.00021 – ident: ref39 doi: 10.1145/3276517 – ident: ref22 doi: 10.1145/3453483.3454052 – ident: ref43 doi: 10.1145/3211346.3211353 – ident: ref25 doi: 10.1109/TSE.2006.28 – ident: ref20 doi: 10.1109/ICSE.2017.44 – ident: ref63 doi: 10.1145/1985441.1985456 – ident: ref51 doi: 10.1145/1993498.1993537 – ident: ref56 doi: 10.1109/TSE.2007.70773 – ident: ref1 doi: 10.1145/3238147.3238213 – volume: 21 start-page: 303 issue: 2 volume-title: Empir. Softw. Eng. year: 2016 ident: ref32 article-title: The impact of tangled code changes on defect prediction models – ident: ref65 doi: 10.1145/2970276.2970359 – ident: ref27 doi: 10.1109/ICPC.2008.41 – ident: ref10 doi: 10.1109/TSE.2007.70731 – ident: ref24 doi: 10.1109/TSE.2002.1019480 – ident: ref53 doi: 10.1109/ASE.2013.6693078 – ident: ref42 doi: 10.1145/3385412.3386001 – ident: ref37 doi: 10.1145/3379597.3387491 – ident: ref11 doi: 10.1145/1985793.1985842 – year: 2020 ident: ref69 article-title: Code clone matching: A practical and effective approach to find code snippets – ident: ref60 doi: 10.1145/3236024.3236047 – ident: ref30 doi: 10.1109/ICPC.2008.24 – ident: ref44 doi: 10.1145/3338906.3340458 – start-page: 956 volume-title: Proc. IEEE/ACM 35th Int. Conf. Autom. Softw. Eng. ident: ref9 article-title: No strings attached: An empirical study of string-related software bugs – ident: ref41 doi: 10.1007/s42979-021-00566-z – ident: ref7 doi: 10.1145/3426422.3426981 – ident: ref19 doi: 10.1145/2642937.2642982 – ident: ref61 doi: 10.1109/ICSE43902.2021.00020 – ident: ref55 doi: 10.1145/1081706.1081754 – ident: ref21 doi: 10.1145/3428287 – ident: ref48 doi: 10.1007/s10664-017-9544-y – ident: ref50 doi: 10.1109/WCRE.2008.44 – ident: ref4 doi: 10.1109/TSE.2020.2998785 – ident: ref6 doi: 10.1145/2568225.2568317 – start-page: 601 volume-title: Proc. USENIX Annu. Tech. Conf. ident: ref36 article-title: Coccinelle: 10 years of automated evolution in the linux kernel  | 
    
| SSID | ssj0005775 ssib053395008  | 
    
| Score | 2.43302 | 
    
| Snippet | The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This... | 
    
| SourceID | proquest crossref ieee  | 
    
| SourceType | Aggregation Database Enrichment Source Index Database Publisher  | 
    
| StartPage | 1 | 
    
| SubjectTerms | Algorithms Codes Database languages Feature extraction Indexing Java Program analysis Python Queries Query languages Search engines Search problems software engineering software maintenance Source code  | 
    
| Title | DiffSearch: A Scalable and Precise Search Engine for Code Changes | 
    
| URI | https://ieeexplore.ieee.org/document/9935264 https://www.proquest.com/docview/2803047657  | 
    
| Volume | 49 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1939-3520 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0005775 issn: 0098-5589 databaseCode: RIE dateStart: 19750101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE  | 
    
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEB7anrxYtYrVKjl4EdztPvJab6W2FKEi2EJvy2aTgFhasduLv95ks1tFRbwteUDIZPLNbGa-AbgKEylFSJWHcyk8zEPsCU2NulPLvaJZhIVNTp4-0Mkc3y_IogE3u1wYpVQZfKZ8-1m-5ct1vrW_yvoGS4kB8CY0GacuV-sznIMxUvNjEsKT-kkySPqzp5FxBKPIjw2ecctK-gWCypoqPy7iEl3GbZjW63JBJS_-thB-_v6NsvG_Cz-A_crMRAN3Lg6hoVZH0K5LOKBKozswuHvW2oUc36KBac-WNpcKZSuJHi3xxUYh140ccyEyRi4arqVCLi9hcwzz8Wg2nHhVVQUvjwkrPMkipTUWKtKEcxXqiFNzxxkYp1oyHXJBdER4EMdaxaY5YYIpbDp0xgTNZXwCrdV6pU4BJRGNKeWUUWH9RMUlISxjWcBxaCSSdaFfb3SaV5TjtvLFMi1djyBJjWhSK5q0Ek0XrnczXh3dxh9jO3and-OqTe5Cr5ZlWunjJrU1uALMKGFnv886hz1bSN7F5PSgVbxt1YUxNwpxWZ6zDw58zlc | 
    
| linkProvider | IEEE | 
    
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEB5qPejF-sRq1Ry8CG67j7zWW1FL1bYIttDbstkkIJZWbHvx15tsdquoiLclDwiZTL6Zzcw3AOdBLKUIqPJwJoWHeYA9oalRd2q5VzQLsbDJyf0B7Y7w_ZiMK3C5yoVRSuXBZ6ppP_O3fDnLlvZXWctgKTEAvgbrBGNMXLbWZ0AHY6RkyCSEx-WjpB-3hk-3xhUMw2ZkEI1bXtIvIJRXVflxFef40qlBv1yZCyt5aS4Xopm9fyNt_O_St2GrMDRR252MHaio6S7UyiIOqNDpPWjfPGvtgo6vUNu0pxObTYXSqUSPlvpirpDrRo67EBkzF13PpEIuM2G-D6PO7fC66xV1FbwsImzhSRYqrbFQoSacq0CHnJpbzgA51ZLpgAuiQ8L9KNIqMs0xE0xh06FTJmgmowOoTmdTdQgoDmlEKaeMCuspKi4JYSlLfY4DI5G0Dq1yo5OsIB23tS8mSe58-HFiRJNY0SSFaOpwsZrx6gg3_hi7Z3d6Na7Y5Do0SlkmhUbOE1uFy8eMEnb0-6wz2OgO-72kdzd4OIZNW1beReg0oLp4W6oTY3wsxGl-5j4ACyTRpA | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DiffSearch%3A+A+Scalable+and+Precise+Search+Engine+for+Code+Changes&rft.jtitle=IEEE+transactions+on+software+engineering&rft.au=Grazia%2C+Luca+Di&rft.au=Bredl%2C+Paul&rft.au=Pradel%2C+Michael&rft.date=2023-04-01&rft.issn=0098-5589&rft.eissn=1939-3520&rft.volume=49&rft.issue=4&rft.spage=2366&rft.epage=2380&rft_id=info:doi/10.1109%2FTSE.2022.3218859&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TSE_2022_3218859 | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0098-5589&client=summon | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0098-5589&client=summon | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0098-5589&client=summon |