CodeDiffuSe: A masked diffusion framework for structure-aware code completion and repair

Code completion and code repair have become fundamental tasks in software engineering and machine learning research. However, existing large language models (LLMs) for code generation, predominantly based on autoregressive modeling (ARM), exhibit limitations when dealing with incomplete or buggy cod...

Full description

Saved in:

Bibliographic Details
Published in	Journal of King Saud University. Computer and information sciences Vol. 37; no. 8; pp. 230 - 25
Main Authors	Onan, Aytug, Alhumyani, Hesham A.
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 01.10.2025 Springer Nature B.V Springer
Subjects	Ablation Benchmarks Code completion Code repair Computer Imaging Computer Science Database Management Diffusion models Error correction Language Large language models Machine Learning Masked diffusion models Natural language Original Paper Pattern Recognition and Graphics Programming languages Regularization Repair Semantics Software Engineering/Programming and Operating Systems Software reliability Structured code generation Syntax Syntax-aware masking Systems and Data Security Theory of Computation Vision Code repair Syntax-aware masking Masked diffusion models Structured code generation Code completion
Online Access	Get full text
ISSN	1319-1578 2213-1248 2213-1248 1319-1578
DOI	10.1007/s44443-025-00237-6

Cover

More Information
Summary:	Code completion and code repair have become fundamental tasks in software engineering and machine learning research. However, existing large language models (LLMs) for code generation, predominantly based on autoregressive modeling (ARM), exhibit limitations when dealing with incomplete or buggy code snippets, especially when the missing or erroneous spans are located arbitrarily within the sequence. In this paper, we propose CodeDiffuSe , a novel masked diffusion framework specifically designed for structure-aware code completion and repair. Unlike traditional ARM approaches, CodeDiffuSe learns to recover missing code spans by leveraging both left and right context, enabling bidirectional reasoning and flexible infilling. We introduce a syntax-aware masking strategy that randomly masks entire Abstract Syntax Tree (AST) subtrees during training, and a semantic consistency regularization that encourages type-correct and syntactically valid predictions. During inference, we propose an error-aware remasking mechanism that dynamically identifies uncertain or invalid tokens and selectively refines them across adaptive reverse diffusion steps. Extensive experiments on standard code completion and bug repair benchmarks, including CodeXGLUE, Defects4J, and QuixBugs, demonstrate that CodeDiffuSe consistently outperforms strong autoregressive baselines such as CodeGen, CodeEditorBench, and InCoder across multiple programming languages. Our work introduces a structure- and semantics-aware diffusion-based alternative for code completion and repair, offering consistent gains in both syntactic validity and functional correctness across diverse benchmarks. Rather than claiming to replace existing paradigms, we demonstrate how diffusion models can complement and improve upon autoregressive and retrieval-based approaches in structure-sensitive settings.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1319-1578 2213-1248 2213-1248 1319-1578
DOI:	10.1007/s44443-025-00237-6