Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been...

Full description

Saved in:
Bibliographic Details
Published in2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) pp. 287 - 298
Main Authors Wang, Deze, Jia, Zhouyang, Li, Shanshan, Yu, Yue, Xiong, Yun, Dong, Wei, Liao, Xiangke
Format Conference Proceeding
LanguageEnglish
Published ACM 01.05.2022
Subjects
Online AccessGet full text
ISSN1558-1225
DOI10.1145/3510003.3510062

Cover

Abstract With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to or-ganize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.
AbstractList With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to or-ganize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.
Author Liao, Xiangke
Yu, Yue
Xiong, Yun
Wang, Deze
Jia, Zhouyang
Li, Shanshan
Dong, Wei
Author_xml – sequence: 1
  givenname: Deze
  surname: Wang
  fullname: Wang, Deze
  email: wangdeze14@nudt.edu.cn
  organization: National University of Defense Technology,China
– sequence: 2
  givenname: Zhouyang
  surname: Jia
  fullname: Jia, Zhouyang
  email: jiazhouyang@nudt.edu.cn
  organization: National University of Defense Technology,China
– sequence: 3
  givenname: Shanshan
  surname: Li
  fullname: Li, Shanshan
  email: shanshanli@nudt.edu.cn
  organization: National University of Defense Technology,China
– sequence: 4
  givenname: Yue
  surname: Yu
  fullname: Yu, Yue
  email: yuyue@nudt.edu.cn
  organization: National University of Defense Technology,China
– sequence: 5
  givenname: Yun
  surname: Xiong
  fullname: Xiong, Yun
  email: yunx@fudan.edu.cn
  organization: Fudan University,Shanghai,China
– sequence: 6
  givenname: Wei
  surname: Dong
  fullname: Dong, Wei
  email: wdong@nudt.edu.cn
  organization: National University of Defense Technology,China
– sequence: 7
  givenname: Xiangke
  surname: Liao
  fullname: Liao, Xiangke
  email: xkliao@nudt.edu.cn
  organization: National University of Defense Technology,China
BookMark eNotjEtLAzEURqMo2NauXbjJH5iad3KXOj6hPsB2XTKTO2W0zUgyIv57g7o68H2HMyVHcYhIyBlnC86VvpCaM8bk4pdGHJA5WFcOJkEIzg_JhGvtKi6EPiHTnN-KbRTAhDxdpT5s-7ilLwmrMfk-YqCPQ8Bdpj4Gej18xTwm9Hu68vk9025I9HX4TC3Sumh0HQOmPBa3VE7Jced3Gef_nJH17c2qvq-Wz3cP9eWy8sLqsbIuCJBMKuWwVdKBgwYc8wI6VXZjubYWUTpE7Swa07TSgA8GvbGuQTkj53_dHhE3H6nf-_S9AQsSNMgfN0NOjA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1145/3510003.3510062
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781450392211
1450392210
EISSN 1558-1225
EndPage 298
ExternalDocumentID 9793959
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: 61690203,61872373,62032019,U1936213
  funderid: 10.13039/501100001809
GroupedDBID -~X
.4S
.DC
123
23M
29O
5VS
6IE
6IF
6IH
6IK
6IL
6IM
6IN
8US
AAJGR
AAWTH
ABLEC
ADZIZ
AFFNX
ALMA_UNASSIGNED_HOLDINGS
APO
ARCSS
AVWKF
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
EDO
FEDTE
I-F
I07
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
XOL
ID FETCH-LOGICAL-a275t-78d29303448ec438989b980a29f4930671577ee38ee587e66bc369ad6ea678be3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:28:31 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a275t-78d29303448ec438989b980a29f4930671577ee38ee587e66bc369ad6ea678be3
PageCount 12
ParticipantIDs ieee_primary_9793959
PublicationCentury 2000
PublicationDate 2022-May
PublicationDateYYYYMMDD 2022-05-01
PublicationDate_xml – month: 05
  year: 2022
  text: 2022-May
PublicationDecade 2020
PublicationTitle 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
PublicationTitleAbbrev ICSE
PublicationYear 2022
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0006499
ssj0002871777
Score 2.5242383
Snippet With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding....
SourceID ieee
SourceType Publisher
StartPage 287
SubjectTerms Adaptation models
Cloning
Codes
curriculum learning
data augmentation
Data models
fine-tuning
Natural languages
Semantics
test-time aug-mentation
Training
Title Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding
URI https://ieeexplore.ieee.org/document/9793959
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NTwIxEJ0AJ0-oYPxODx7dxe728ypKiInEREi4kbY7e0HByHLx19uWBYnx4GmbzRyabjsznX1vHsCNMI6h1N77ORQJk2WeKFaaRHJmKS2p95YR5TsSwwl7mvJpA253XBhEjOAzTMMw_ssvlm4dSmU97TeT5roJTb_NNlytXT0lZP6xtV3thYVP5etWPpTxXs5DITtP4zNI4-xpqcRQMmjD83YSGwTJPF1XNnVfv_oz_neWh9D9Ie2Rl104OoIGLo6hvVVtIPUh7sDoPnC0vIm3xiRKRGBBgiba24qYRUEeQsE5ANDfydis5ivi81ryGov8pO_NyGSfENOFyeBx3B8mtapCYjLJq0Sqwof40OlPoQva50pbre5MpkumwwWCcikRc4XIlUQhrMuFNoVA4wObxfwEWovlAk-ByJxiQSUyg4apzFikmXRWl-i4ZlacQScsz-xj0zhjVq_M-d-vL-AgC9yCiCa8hFb1ucYrH_Erex0_9TcfIqng
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKGWAq0CK-8cBIUpz4c6VQFWgrJFqpW2U7lwVIEU0Xfj22m5YKMTAlim6wbOfufH7vHkJXXFsKQjnvZ4FHVORpJGmuI8GoISQnzlsGlO-Q98b0ccImNXS95sIAQACfQexfw11-NrMLXyprK7eZFFNbaJu5U4VYsrXWFRWf-4fmdpUf5i6Zr5r5EMraKfOl7DQOTy-Os6GmEoJJt4EGq2EsMSSv8aI0sf361aHxv-PcQ60f2h5-XgekfVSD4gA1VroNuPqNm2h461lazsRZQxREIiDDXhXtbY51keE7X3L2EPR3PNLz1zl2mS1-CWV-3HFmeLxJiWmhcfd-1OlFla5CpBPBykjIzAV53-tPgvXq51IZJW90onKq_BGCMCEAUgnApADOjU250hkH7UKbgfQQ1YtZAUcIi5RARgRQDZrKRBsgibBG5WCZooYfo6afnunHsnXGtJqZk78_X6Kd3mjQn_Yfhk-naDfxTIOALTxD9fJzAecu_pfmIiz7Nz87rTI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE%2FACM+44th+International+Conference+on+Software+Engineering+%28ICSE%29&rft.atitle=Bridging+Pre-trained+Models+and+Downstream+Tasks+for+Source+Code+Understanding&rft.au=Wang%2C+Deze&rft.au=Jia%2C+Zhouyang&rft.au=Li%2C+Shanshan&rft.au=Yu%2C+Yue&rft.date=2022-05-01&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=287&rft.epage=298&rft_id=info:doi/10.1145%2F3510003.3510062&rft.externalDocID=9793959