Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding
With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been...
Saved in:
Published in | 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) pp. 287 - 298 |
---|---|
Main Authors | , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
ACM
01.05.2022
|
Subjects | |
Online Access | Get full text |
ISSN | 1558-1225 |
DOI | 10.1145/3510003.3510062 |
Cover
Abstract | With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to or-ganize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models. |
---|---|
AbstractList | With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to or-ganize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models. |
Author | Liao, Xiangke Yu, Yue Xiong, Yun Wang, Deze Jia, Zhouyang Li, Shanshan Dong, Wei |
Author_xml | – sequence: 1 givenname: Deze surname: Wang fullname: Wang, Deze email: wangdeze14@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 2 givenname: Zhouyang surname: Jia fullname: Jia, Zhouyang email: jiazhouyang@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 3 givenname: Shanshan surname: Li fullname: Li, Shanshan email: shanshanli@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 4 givenname: Yue surname: Yu fullname: Yu, Yue email: yuyue@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 5 givenname: Yun surname: Xiong fullname: Xiong, Yun email: yunx@fudan.edu.cn organization: Fudan University,Shanghai,China – sequence: 6 givenname: Wei surname: Dong fullname: Dong, Wei email: wdong@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 7 givenname: Xiangke surname: Liao fullname: Liao, Xiangke email: xkliao@nudt.edu.cn organization: National University of Defense Technology,China |
BookMark | eNotjEtLAzEURqMo2NauXbjJH5iad3KXOj6hPsB2XTKTO2W0zUgyIv57g7o68H2HMyVHcYhIyBlnC86VvpCaM8bk4pdGHJA5WFcOJkEIzg_JhGvtKi6EPiHTnN-KbRTAhDxdpT5s-7ilLwmrMfk-YqCPQ8Bdpj4Gej18xTwm9Hu68vk9025I9HX4TC3Sumh0HQOmPBa3VE7Jced3Gef_nJH17c2qvq-Wz3cP9eWy8sLqsbIuCJBMKuWwVdKBgwYc8wI6VXZjubYWUTpE7Swa07TSgA8GvbGuQTkj53_dHhE3H6nf-_S9AQsSNMgfN0NOjA |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1145/3510003.3510062 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9781450392211 1450392210 |
EISSN | 1558-1225 |
EndPage | 298 |
ExternalDocumentID | 9793959 |
Genre | orig-research |
GrantInformation_xml | – fundername: National Natural Science Foundation of China grantid: 61690203,61872373,62032019,U1936213 funderid: 10.13039/501100001809 |
GroupedDBID | -~X .4S .DC 123 23M 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ AFFNX ALMA_UNASSIGNED_HOLDINGS APO ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO FEDTE I-F I07 IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS XOL |
ID | FETCH-LOGICAL-a275t-78d29303448ec438989b980a29f4930671577ee38ee587e66bc369ad6ea678be3 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:28:31 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a275t-78d29303448ec438989b980a29f4930671577ee38ee587e66bc369ad6ea678be3 |
PageCount | 12 |
ParticipantIDs | ieee_primary_9793959 |
PublicationCentury | 2000 |
PublicationDate | 2022-May |
PublicationDateYYYYMMDD | 2022-05-01 |
PublicationDate_xml | – month: 05 year: 2022 text: 2022-May |
PublicationDecade | 2020 |
PublicationTitle | 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) |
PublicationTitleAbbrev | ICSE |
PublicationYear | 2022 |
Publisher | ACM |
Publisher_xml | – name: ACM |
SSID | ssj0006499 ssj0002871777 |
Score | 2.5242383 |
Snippet | With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding.... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 287 |
SubjectTerms | Adaptation models Cloning Codes curriculum learning data augmentation Data models fine-tuning Natural languages Semantics test-time aug-mentation Training |
Title | Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding |
URI | https://ieeexplore.ieee.org/document/9793959 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NTwIxEJ0AJ0-oYPxODx7dxe728ypKiInEREi4kbY7e0HByHLx19uWBYnx4GmbzRyabjsznX1vHsCNMI6h1N77ORQJk2WeKFaaRHJmKS2p95YR5TsSwwl7mvJpA253XBhEjOAzTMMw_ssvlm4dSmU97TeT5roJTb_NNlytXT0lZP6xtV3thYVP5etWPpTxXs5DITtP4zNI4-xpqcRQMmjD83YSGwTJPF1XNnVfv_oz_neWh9D9Ie2Rl104OoIGLo6hvVVtIPUh7sDoPnC0vIm3xiRKRGBBgiba24qYRUEeQsE5ANDfydis5ivi81ryGov8pO_NyGSfENOFyeBx3B8mtapCYjLJq0Sqwof40OlPoQva50pbre5MpkumwwWCcikRc4XIlUQhrMuFNoVA4wObxfwEWovlAk-ByJxiQSUyg4apzFikmXRWl-i4ZlacQScsz-xj0zhjVq_M-d-vL-AgC9yCiCa8hFb1ucYrH_Erex0_9TcfIqng |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKGWAq0CK-8cBIUpz4c6VQFWgrJFqpW2U7lwVIEU0Xfj22m5YKMTAlim6wbOfufH7vHkJXXFsKQjnvZ4FHVORpJGmuI8GoISQnzlsGlO-Q98b0ccImNXS95sIAQACfQexfw11-NrMLXyprK7eZFFNbaJu5U4VYsrXWFRWf-4fmdpUf5i6Zr5r5EMraKfOl7DQOTy-Os6GmEoJJt4EGq2EsMSSv8aI0sf361aHxv-PcQ60f2h5-XgekfVSD4gA1VroNuPqNm2h461lazsRZQxREIiDDXhXtbY51keE7X3L2EPR3PNLz1zl2mS1-CWV-3HFmeLxJiWmhcfd-1OlFla5CpBPBykjIzAV53-tPgvXq51IZJW90onKq_BGCMCEAUgnApADOjU250hkH7UKbgfQQ1YtZAUcIi5RARgRQDZrKRBsgibBG5WCZooYfo6afnunHsnXGtJqZk78_X6Kd3mjQn_Yfhk-naDfxTIOALTxD9fJzAecu_pfmIiz7Nz87rTI |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE%2FACM+44th+International+Conference+on+Software+Engineering+%28ICSE%29&rft.atitle=Bridging+Pre-trained+Models+and+Downstream+Tasks+for+Source+Code+Understanding&rft.au=Wang%2C+Deze&rft.au=Jia%2C+Zhouyang&rft.au=Li%2C+Shanshan&rft.au=Yu%2C+Yue&rft.date=2022-05-01&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=287&rft.epage=298&rft_id=info:doi/10.1145%2F3510003.3510062&rft.externalDocID=9793959 |