Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been...

Full description

Saved in:

Bibliographic Details
Published in	2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) pp. 287 - 298
Main Authors	Wang, Deze, Jia, Zhouyang, Li, Shanshan, Yu, Yue, Xiong, Yun, Dong, Wei, Liao, Xiangke
Format	Conference Proceeding
Language	English
Published	ACM 01.05.2022
Subjects	Adaptation models Cloning Codes curriculum learning data augmentation Data models fine-tuning Natural languages Semantics test-time aug-mentation Training
Online Access	Get full text
ISSN	1558-1225
DOI	10.1145/3510003.3510062

Cover

Abstract	With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to or-ganize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.
AbstractList	With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to or-ganize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.
Author	Liao, Xiangke Yu, Yue Xiong, Yun Wang, Deze Jia, Zhouyang Li, Shanshan Dong, Wei
Author_xml	– sequence: 1 givenname: Deze surname: Wang fullname: Wang, Deze email: wangdeze14@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 2 givenname: Zhouyang surname: Jia fullname: Jia, Zhouyang email: jiazhouyang@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 3 givenname: Shanshan surname: Li fullname: Li, Shanshan email: shanshanli@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 4 givenname: Yue surname: Yu fullname: Yu, Yue email: yuyue@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 5 givenname: Yun surname: Xiong fullname: Xiong, Yun email: yunx@fudan.edu.cn organization: Fudan University,Shanghai,China – sequence: 6 givenname: Wei surname: Dong fullname: Dong, Wei email: wdong@nudt.edu.cn organization: National University of Defense Technology,China – sequence: 7 givenname: Xiangke surname: Liao fullname: Liao, Xiangke email: xkliao@nudt.edu.cn organization: National University of Defense Technology,China
BookMark	eNotjEtLAzEURqMo2NauXbjJH5iad3KXOj6hPsB2XTKTO2W0zUgyIv57g7o68H2HMyVHcYhIyBlnC86VvpCaM8bk4pdGHJA5WFcOJkEIzg_JhGvtKi6EPiHTnN-KbRTAhDxdpT5s-7ilLwmrMfk-YqCPQ8Bdpj4Gej18xTwm9Hu68vk9025I9HX4TC3Sumh0HQOmPBa3VE7Jced3Gef_nJH17c2qvq-Wz3cP9eWy8sLqsbIuCJBMKuWwVdKBgwYc8wI6VXZjubYWUTpE7Swa07TSgA8GvbGuQTkj53_dHhE3H6nf-_S9AQsSNMgfN0NOjA
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1145/3510003.3510062
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9781450392211 1450392210
EISSN	1558-1225
EndPage	298
ExternalDocumentID	9793959
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China grantid: 61690203,61872373,62032019,U1936213 funderid: 10.13039/501100001809
GroupedDBID	-~X .4S .DC 123 23M 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ AFFNX ALMA_UNASSIGNED_HOLDINGS APO ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO FEDTE I-F I07 IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS XOL
ID	FETCH-LOGICAL-a275t-78d29303448ec438989b980a29f4930671577ee38ee587e66bc369ad6ea678be3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:28:31 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a275t-78d29303448ec438989b980a29f4930671577ee38ee587e66bc369ad6ea678be3
PageCount	12
ParticipantIDs	ieee_primary_9793959
PublicationCentury	2000
PublicationDate	2022-May
PublicationDateYYYYMMDD	2022-05-01
PublicationDate_xml	– month: 05 year: 2022 text: 2022-May
PublicationDecade	2020
PublicationTitle	2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
PublicationTitleAbbrev	ICSE
PublicationYear	2022
Publisher	ACM
Publisher_xml	– name: ACM
SSID	ssj0006499 ssj0002871777
Score	2.5242383
Snippet	With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding....
SourceID	ieee
SourceType	Publisher
StartPage	287
SubjectTerms	Adaptation models Cloning Codes curriculum learning data augmentation Data models fine-tuning Natural languages Semantics test-time aug-mentation Training
Title	Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding
URI	https://ieeexplore.ieee.org/document/9793959
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NTwIxEJ0AJ0-oYPxODx7dxe728ypKiInEREi4kbY7e0HByHLx19uWBYnx4GmbzRyabjsznX1vHsCNMI6h1N77ORQJk2WeKFaaRHJmKS2p95YR5TsSwwl7mvJpA253XBhEjOAzTMMw_ssvlm4dSmU97TeT5roJTb_NNlytXT0lZP6xtV3thYVP5etWPpTxXs5DITtP4zNI4-xpqcRQMmjD83YSGwTJPF1XNnVfv_oz_neWh9D9Ie2Rl104OoIGLo6hvVVtIPUh7sDoPnC0vIm3xiRKRGBBgiba24qYRUEeQsE5ANDfydis5ivi81ryGov8pO_NyGSfENOFyeBx3B8mtapCYjLJq0Sqwof40OlPoQva50pbre5MpkumwwWCcikRc4XIlUQhrMuFNoVA4wObxfwEWovlAk-ByJxiQSUyg4apzFikmXRWl-i4ZlacQScsz-xj0zhjVq_M-d-vL-AgC9yCiCa8hFb1ucYrH_Erex0_9TcfIqng
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKGWAq0CK-8cBIUpz4c6VQFWgrJFqpW2U7lwVIEU0Xfj22m5YKMTAlim6wbOfufH7vHkJXXFsKQjnvZ4FHVORpJGmuI8GoISQnzlsGlO-Q98b0ccImNXS95sIAQACfQexfw11-NrMLXyprK7eZFFNbaJu5U4VYsrXWFRWf-4fmdpUf5i6Zr5r5EMraKfOl7DQOTy-Os6GmEoJJt4EGq2EsMSSv8aI0sf361aHxv-PcQ60f2h5-XgekfVSD4gA1VroNuPqNm2h461lazsRZQxREIiDDXhXtbY51keE7X3L2EPR3PNLz1zl2mS1-CWV-3HFmeLxJiWmhcfd-1OlFla5CpBPBykjIzAV53-tPgvXq51IZJW90onKq_BGCMCEAUgnApADOjU250hkH7UKbgfQQ1YtZAUcIi5RARgRQDZrKRBsgibBG5WCZooYfo6afnunHsnXGtJqZk78_X6Kd3mjQn_Yfhk-naDfxTIOALTxD9fJzAecu_pfmIiz7Nz87rTI
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE%2FACM+44th+International+Conference+on+Software+Engineering+%28ICSE%29&rft.atitle=Bridging+Pre-trained+Models+and+Downstream+Tasks+for+Source+Code+Understanding&rft.au=Wang%2C+Deze&rft.au=Jia%2C+Zhouyang&rft.au=Li%2C+Shanshan&rft.au=Yu%2C+Yue&rft.date=2022-05-01&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=287&rft.epage=298&rft_id=info:doi/10.1145%2F3510003.3510062&rft.externalDocID=9793959