Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been...

Full description

Saved in:

Bibliographic Details
Published in	2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) pp. 287 - 298
Main Authors	Wang, Deze, Jia, Zhouyang, Li, Shanshan, Yu, Yue, Xiong, Yun, Dong, Wei, Liao, Xiangke
Format	Conference Proceeding
Language	English
Published	ACM 01.05.2022
Subjects	Adaptation models Cloning Codes curriculum learning data augmentation Data models fine-tuning Natural languages Semantics test-time aug-mentation Training
Online Access	Get full text
ISSN	1558-1225
DOI	10.1145/3510003.3510062

Cover

More Information
Summary:	With the great success of pre-trained models, the pretrain-then-fine tune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to or-ganize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.
ISSN:	1558-1225
DOI:	10.1145/3510003.3510062