CodeDiffuSe: A masked diffusion framework for structure-aware code completion and repair
Code completion and code repair have become fundamental tasks in software engineering and machine learning research. However, existing large language models (LLMs) for code generation, predominantly based on autoregressive modeling (ARM), exhibit limitations when dealing with incomplete or buggy cod...
Saved in:
| Published in | Journal of King Saud University. Computer and information sciences Vol. 37; no. 8; pp. 230 - 25 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
Cham
Springer International Publishing
01.10.2025
Springer Nature B.V Springer |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1319-1578 2213-1248 2213-1248 1319-1578 |
| DOI | 10.1007/s44443-025-00237-6 |
Cover
| Summary: | Code completion and code repair have become fundamental tasks in software engineering and machine learning research. However, existing large language models (LLMs) for code generation, predominantly based on autoregressive modeling (ARM), exhibit limitations when dealing with incomplete or buggy code snippets, especially when the missing or erroneous spans are located arbitrarily within the sequence. In this paper, we propose
CodeDiffuSe
, a novel masked diffusion framework specifically designed for structure-aware code completion and repair. Unlike traditional ARM approaches, CodeDiffuSe learns to recover missing code spans by leveraging both left and right context, enabling bidirectional reasoning and flexible infilling. We introduce a syntax-aware masking strategy that randomly masks entire Abstract Syntax Tree (AST) subtrees during training, and a semantic consistency regularization that encourages type-correct and syntactically valid predictions. During inference, we propose an error-aware remasking mechanism that dynamically identifies uncertain or invalid tokens and selectively refines them across adaptive reverse diffusion steps. Extensive experiments on standard code completion and bug repair benchmarks, including CodeXGLUE, Defects4J, and QuixBugs, demonstrate that CodeDiffuSe consistently outperforms strong autoregressive baselines such as CodeGen, CodeEditorBench, and InCoder across multiple programming languages. Our work introduces a structure- and semantics-aware diffusion-based alternative for code completion and repair, offering consistent gains in both syntactic validity and functional correctness across diverse benchmarks. Rather than claiming to replace existing paradigms, we demonstrate how diffusion models can complement and improve upon autoregressive and retrieval-based approaches in structure-sensitive settings. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1319-1578 2213-1248 2213-1248 1319-1578 |
| DOI: | 10.1007/s44443-025-00237-6 |