CodeDiffuSe: A masked diffusion framework for structure-aware code completion and repair

Code completion and code repair have become fundamental tasks in software engineering and machine learning research. However, existing large language models (LLMs) for code generation, predominantly based on autoregressive modeling (ARM), exhibit limitations when dealing with incomplete or buggy cod...

Full description

Saved in:
Bibliographic Details
Published inJournal of King Saud University. Computer and information sciences Vol. 37; no. 8; pp. 230 - 25
Main Authors Onan, Aytug, Alhumyani, Hesham A.
Format Journal Article
LanguageEnglish
Published Cham Springer International Publishing 01.10.2025
Springer Nature B.V
Springer
Subjects
Online AccessGet full text
ISSN1319-1578
2213-1248
2213-1248
1319-1578
DOI10.1007/s44443-025-00237-6

Cover

More Information
Summary:Code completion and code repair have become fundamental tasks in software engineering and machine learning research. However, existing large language models (LLMs) for code generation, predominantly based on autoregressive modeling (ARM), exhibit limitations when dealing with incomplete or buggy code snippets, especially when the missing or erroneous spans are located arbitrarily within the sequence. In this paper, we propose CodeDiffuSe , a novel masked diffusion framework specifically designed for structure-aware code completion and repair. Unlike traditional ARM approaches, CodeDiffuSe learns to recover missing code spans by leveraging both left and right context, enabling bidirectional reasoning and flexible infilling. We introduce a syntax-aware masking strategy that randomly masks entire Abstract Syntax Tree (AST) subtrees during training, and a semantic consistency regularization that encourages type-correct and syntactically valid predictions. During inference, we propose an error-aware remasking mechanism that dynamically identifies uncertain or invalid tokens and selectively refines them across adaptive reverse diffusion steps. Extensive experiments on standard code completion and bug repair benchmarks, including CodeXGLUE, Defects4J, and QuixBugs, demonstrate that CodeDiffuSe consistently outperforms strong autoregressive baselines such as CodeGen, CodeEditorBench, and InCoder across multiple programming languages. Our work introduces a structure- and semantics-aware diffusion-based alternative for code completion and repair, offering consistent gains in both syntactic validity and functional correctness across diverse benchmarks. Rather than claiming to replace existing paradigms, we demonstrate how diffusion models can complement and improve upon autoregressive and retrieval-based approaches in structure-sensitive settings.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1319-1578
2213-1248
2213-1248
1319-1578
DOI:10.1007/s44443-025-00237-6