new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 15

Flexible Non-intrusive Dynamic Instrumentation for WebAssembly

A key strength of managed runtimes over hardware is the ability to gain detailed insight into the dynamic execution of programs with instrumentation. Analyses such as code coverage, execution frequency, tracing, and debugging, are all made easier in a virtual setting. As a portable, low-level bytecode, WebAssembly offers inexpensive in-process sandboxing with high performance. Yet to date, Wasm engines have not offered much insight into executing programs, supporting at best bytecode-level stepping and basic source maps, but no instrumentation capabilities. In this paper, we show the first non-intrusive dynamic instrumentation system for WebAssembly in the open-source Wizard Research Engine. Our innovative design offers a flexible, complete hierarchy of instrumentation primitives that support building high-level, complex analyses in terms of low-level, programmable probes. In contrast to emulation or machine code instrumentation, injecting probes at the bytecode level increases expressiveness and vastly simplifies the implementation by reusing the engine's JIT compiler, interpreter, and deoptimization mechanism rather than building new ones. Wizard supports both dynamic instrumentation insertion and removal while providing consistency guarantees, which is key to composing multiple analyses without interference. We detail a fully-featured implementation in a high-performance multi-tier Wasm engine, show novel optimizations specifically designed to minimize instrumentation overhead, and evaluate performance characteristics under load from various analyses. This design is well-suited for production engine adoption as probes can be implemented to have no impact on production performance when not in use.

  • 6 authors
·
Mar 12, 2024

STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning

In recent years, large language models (LLMs) have made significant progress in code intelligence, yet systematically evaluating their code understanding and reasoning abilities remains challenging. Mainstream benchmarks such as HumanEval and MBPP primarily assess functional correctness, while reasoning benchmarks like CRUXEVAL are limited to single-function, low-complexity scenarios. As a result, advanced models achieve nearly saturated scores, limiting their discriminative power. To address this, we present STEPWISE-CODEX-Bench (SX-Bench), a novel benchmark designed for complex multi-function understanding and fine-grained execution reasoning. SX-Bench features tasks involving collaboration among multiple sub-functions (e.g., chained calls, nested loops), shifting evaluation towards overall control and data flow modeling. It defines "computation steps" as the minimal execution unit and requires models to predict the total number of steps in reasoning tasks, thereby assessing a model's in-depth understanding of dynamic execution beyond simple I/O matching. Evaluation on over 20 mainstream models (including 14 reasoning-enhanced models) demonstrates that SX-Bench is highly discriminative: even the state-of-the-art OpenAI-O3 achieves only 78.37 percent accuracy on Hard-Reasoning tasks, much lower than its saturated scores on previous benchmarks, thereby revealing bottlenecks in complex and fine-grained reasoning. We also release an automated pipeline combining program synthesis, symbolic execution, and LLM-aided validation for efficient benchmark generation and quality assurance. SX-Bench advances code evaluation from "single-function verification" to "multi-function dynamic reasoning," providing a key tool for the in-depth assessment of advanced code intelligence models.

  • 6 authors
·
Aug 7, 2025

Lifecycle-Aware code generation: Leveraging Software Engineering Phases in LLMs

Recent progress in large language models (LLMs) has advanced automatic code generation, yet most approaches rely on direct, single-step translation from problem descriptions to code, disregarding structured software engineering practices. We introduce a lifecycle-aware framework that systematically incorporates intermediate artifacts such as requirements analysis, state machine modeling, and pseudocode into both the training and inference stages. This design aligns code generation with standard software development phases and enables more structured reasoning. Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning, with performance gains compounding across intermediate stages. Multi-step inference consistently surpasses single-step generation, demonstrating the effectiveness of intermediate scaffolding. Notably, open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code. When applied to DeepSeek-Coder-1.3B, our framework yields relative CodeBLEU improvements of 34.3%, 20.0%, 11.2%, and 22.3% over ChatGPT-3.5, ChatGPT-4o-mini, DeepSeek-R1, and LLaMA-8B, respectively. Our pipeline also proves robust with up to 80\% less training data, confirming its resilience. Ablation studies further reveal that each intermediate artifact contributes distinctly to final code quality, with state machine modeling yielding the most substantial impact. Our source code and detailed experimental data are available at https://anonymous.4open.science/r/Lifecycle-Aware-3CCB.

  • 5 authors
·
Oct 27, 2025

UniCoder: Scaling Code Large Language Model via Universal Code

Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UniCoder-Instruct to train our model UniCoder on multi-task learning objectives. UniCoder-Instruct comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UniCoder with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.

  • 9 authors
·
Jun 24, 2024

CodePlan: Repository-level Coding using LLMs and Planning

Software engineering activities such as package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. We formulate these activities as repository-level coding tasks. Recent tools like GitHub Copilot, which are powered by Large Language Models (LLMs), have succeeded in offering high-quality solutions to localized coding problems. Repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is inter-dependent and the entire repository may be too large to fit into the prompt. We frame repository-level coding as a planning problem and present a task-agnostic framework, called CodePlan to solve it. CodePlan synthesizes a multi-step chain of edits (plan), where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. CodePlan is based on a novel combination of an incremental dependency analysis, a change may-impact analysis and an adaptive planning algorithm. We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python). Each task is evaluated on multiple code repositories, each of which requires inter-dependent changes to many files (between 2-97 files). Coding tasks of this level of complexity have not been automated using LLMs before. Our results show that CodePlan has better match with the ground truth compared to baselines. CodePlan is able to get 5/6 repositories to pass the validity checks (e.g., to build without errors and make correct code edits) whereas the baselines (without planning but with the same type of contextual information as CodePlan) cannot get any of the repositories to pass them.

  • 9 authors
·
Sep 21, 2023 14

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.

  • 4 authors
·
Dec 30, 2024 3

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at https://github.com/FudanSELab/ClassEval.

  • 10 authors
·
Aug 3, 2023

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.

  • 3 authors
·
Oct 3, 2024 3

EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation

Rust's compile-time safety guarantees make it ideal for safety-critical systems, creating demand for translating legacy C codebases to Rust. While various approaches have emerged for this task, they face inherent trade-offs: rule-based solutions face challenges in meeting code safety and idiomaticity requirements, while LLM-based solutions often fail to generate semantically equivalent Rust code, due to the heavy dependencies of modules across the entire codebase. Recent studies have revealed that both solutions are limited to small-scale programs. In this paper, we propose EvoC2Rust, an automated framework for converting entire C projects to equivalent Rust ones. EvoC2Rust employs a skeleton-guided translation strategy for project-level translation. The pipeline consists of three evolutionary stages: 1) it first decomposes the C project into functional modules, employs a feature-mapping-enhanced LLM to transform definitions and macros and generates type-checked function stubs, which form a compilable Rust skeleton; 2) it then incrementally translates the function, replacing the corresponding stub placeholder; 3) finally, it repairs compilation errors by integrating LLM and static analysis. Through evolutionary augmentation, EvoC2Rust combines the advantages of both rule-based and LLM-based solutions. Our evaluation on open-source benchmarks and six industrial projects demonstrates EvoC2Rust's superior performance in project-level C-to-Rust translation. On average, it achieves 17.24% and 14.32% improvements in syntax and semantic accuracy over the LLM-based approaches, along with a 96.79% higher code safety rate than the rule-based tools. At the module level, EvoC2Rust reaches 92.25% compilation and 89.53% test pass rates on industrial projects, even for complex codebases and long functions.

  • 8 authors
·
Aug 6, 2025 2

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback

Humans write code in a fundamentally interactive manner and rely on constant execution feedback to correct errors, resolve ambiguities, and decompose tasks. While LLMs have recently exhibited promising coding capabilities, current coding benchmarks mostly consider a static instruction-to-code sequence transduction process, which has the potential for error propagation and a disconnect between the generated code and its final execution environment. To address this gap, we introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning (RL) environment, with code as actions and execution feedback as observations. Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution, and is compatible out-of-the-box with traditional seq2seq coding methods, while enabling the development of new methods for interactive code generation. We use InterCode to create two interactive code environments with Bash and SQL as action spaces, leveraging data from the static Spider and NL2Bash datasets. We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies such as ReAct and Plan & Solve. Our results showcase the benefits of interactive code generation and demonstrate that InterCode can serve as a challenging benchmark for advancing code understanding and generation capabilities. InterCode is designed to be easily extensible and can even be used to incorporate new tasks such as Capture the Flag, a popular coding puzzle that is inherently multi-step and involves multiple programming languages. Project site with code and data: https://intercode-benchmark.github.io

  • 4 authors
·
Jun 26, 2023

Learning Type Inference for Enhanced Dataflow Analysis

Statically analyzing dynamically-typed code is a challenging endeavor, as even seemingly trivial tasks such as determining the targets of procedure calls are non-trivial without knowing the types of objects at compile time. Addressing this challenge, gradual typing is increasingly added to dynamically-typed languages, a prominent example being TypeScript that introduces static typing to JavaScript. Gradual typing improves the developer's ability to verify program behavior, contributing to robust, secure and debuggable programs. In practice, however, users only sparsely annotate types directly. At the same time, conventional type inference faces performance-related challenges as program size grows. Statistical techniques based on machine learning offer faster inference, but although recent approaches demonstrate overall improved accuracy, they still perform significantly worse on user-defined types than on the most common built-in types. Limiting their real-world usefulness even more, they rarely integrate with user-facing applications. We propose CodeTIDAL5, a Transformer-based model trained to reliably predict type annotations. For effective result retrieval and re-integration, we extract usage slices from a program's code property graph. Comparing our approach against recent neural type inference systems, our model outperforms the current state-of-the-art by 7.85% on the ManyTypes4TypeScript benchmark, achieving 71.27% accuracy overall. Furthermore, we present JoernTI, an integration of our approach into Joern, an open source static analysis tool, and demonstrate that the analysis benefits from the additional type information. As our model allows for fast inference times even on commodity CPUs, making our system available through Joern leads to high accessibility and facilitates security research.

  • 6 authors
·
Oct 1, 2023 1

Empirical Research on Utilizing LLM-based Agents for Automated Bug Fixing via LangGraph

This paper presents a novel framework for automated code generation and debugging, designed to improve accuracy, efficiency, and scalability in software development. The proposed system integrates three core components LangGraph, GLM4 Flash, and ChromaDB within a four step iterative workflow to deliver robust performance and seamless functionality. LangGraph serves as a graph-based library for orchestrating tasks, providing precise control and execution while maintaining a unified state object for dynamic updates and consistency. It supports multi-agent, hierarchical, and sequential processes, making it highly adaptable to complex software engineering workflows. GLM4 Flash, a large language model, leverages its advanced capabilities in natural language understanding, contextual reasoning, and multilingual support to generate accurate code snippets based on user prompts. ChromaDB acts as a vector database for semantic search and contextual memory storage, enabling the identification of patterns and the generation of context-aware bug fixes based on historical data. The system operates through a structured four-step process: (1) Code Generation, which translates natural language descriptions into executable code; (2) Code Execution, which validates the code by identifying runtime errors and inconsistencies; (3) Code Repair, which iteratively refines buggy code using ChromaDB's memory capabilities and LangGraph's state tracking; and (4) Code Update, which ensures the code meets functional and performance requirements through iterative modifications.

  • 2 authors
·
Jan 29, 2025

Sifting through the Chaff: On Utilizing Execution Feedback for Ranking the Generated Code Candidates

Large Language Models (LLMs), such as GPT-4, StarCoder, and CodeLlama, are transforming the way developers approach programming by automatically generating code based on given natural language descriptions. Despite advancements, generating syntactically and semantically correct code remains challenging, especially for complex programming tasks. Existing approaches typically generate multiple candidate solutions using LLMs to increase the likelihood of producing correct code. However, selecting the correct code from these candidates-a process known as code ranking-remains a major challenge. Current research on code ranking can be categorized into execution-based and non-execution-based methods. Execution-based methods, although effective, encounter notable limitations, such as scarcity of quality unit tests and security risks. Non-execution-based methods like CodeRanker, which rely solely on classification labels to train a code ranker, struggle to capture subtle errors and provide detailed error insights. Recognizing the strengths and limitations of both approaches, we propose a new method. The key insight of our work is that an effective code ranker is expected to truly comprehend the underlying causes of erroneous code, as relying solely on classification labels is insufficient. Inspired by this, this paper puts forward RankEF, an innovative approach for code ranking that leverages execution feedback. RankEF employs multi-task learning to integrate code classification with execution feedback generation. This approach enables the model to understand the reasons behind incorrect code, distinguishing between correct and incorrect solutions without the need to execute the code during the ranking phase. Experiments on three code generation benchmarks demonstrate that RankEF significantly outperforms the state-of-the-art CodeRanker.

  • 7 authors
·
Aug 25, 2024

Compiling C to Safe Rust, Formalized

The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C, and cannot be realistically rewritten by hand. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages. We instead explore a different path, and explore what it would take to translate C to safe Rust; that is, to produce code that is trivially memory safe, because it abides by Rust's type system without caveats. Our work sports several original contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on "split trees" that allows expressing C's pointer arithmetic using Rust's slices and splitting operations; an analysis that infers exactly which borrows need to be mutable; and a compilation strategy for C's struct types that is compatible with Rust's distinction between non-owned and owned allocations. We apply our methodology to existing formally verified C codebases: the HACL* cryptographic library, and binary parsers and serializers from EverParse, and show that the subset of C we support is sufficient to translate both applications to safe Rust. Our evaluation shows that for the few places that do violate Rust's aliasing discipline, automated, surgical rewrites suffice; and that the few strategic copies we insert have a negligible performance impact. Of particular note, the application of our approach to HACL* results in a 80,000 line verified cryptographic library, written in pure Rust, that implements all modern algorithms - the first of its kind.

  • 2 authors
·
Dec 19, 2024

LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce Large Language Model Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections.

  • 3 authors
·
Feb 24, 2024

RustMap: Towards Project-Scale C-to-Rust Migration via Program Analysis and LLM

Migrating existing C programs into Rust is increasingly desired, as Rust offers superior memory safety while maintaining C's high performance. However, vastly different features between C and Rust--e.g., distinct definitions and usages of pointers and references--pose significant challenges beyond mere syntactic translation. Existing automated translation tools, such as C2Rust, may rely too much on syntactic, template-based translation and generate unsafe Rust code that is hard for human developers to read, maintain, or even compile. More semantic-aware translation that produces safer, idiomatic, and runnable Rust code is much needed. This paper introduces a novel dependency-guided and large language model (LLM)-based C-to-Rust translation approach, RustMap, based on three key ideas: (1) Utilize LLM capabilities to produce idiomatic Rust code from given small pieces of C code, (2) Mitigate LLM limitations in handling large codebases by breaking project-scale C programs into smaller units for translation according to their usage dependencies and composing them into a runnable Rust program, and (3) Enhance the correctness of the translated Rust program by using test cases to check input/output equivalence, isolate faulty code when execution states deviate, and iteratively refine the translation using feedback from compilation and test errors. We empirically evaluate RustMap on 126 real-world programs, including 125 from Rosetta Code and a 7000+ line bzip2 implementation using GPT-4o as the LLM. RustMap shows promising results, guiding GPT-4o to produce idiomatic, readable, and functional Rust code with significantly less unsafe code than other tools, and revealing non-trivial translation patterns reusable for future research.

  • 9 authors
·
Mar 22, 2025

SuperCoder2.0: Technical Report on Exploring the feasibility of LLMs as Autonomous Programmer

We present SuperCoder2.0, an advanced autonomous system designed to enhance software development through artificial intelligence. The system combines an AI-native development approach with intelligent agents to enable fully autonomous coding. Key focus areas include a retry mechanism with error output traceback, comprehensive code rewriting and replacement using Abstract Syntax Tree (ast) parsing to minimize linting issues, code embedding technique for retrieval-augmented generation, and a focus on localizing methods for problem-solving rather than identifying specific line numbers. The methodology employs a three-step hierarchical search space reduction approach for code base navigation and bug localization:utilizing Retrieval Augmented Generation (RAG) and a Repository File Level Map to identify candidate files, (2) narrowing down to the most relevant files using a File Level Schematic Map, and (3) extracting 'relevant locations' within these files. Code editing is performed through a two-part module comprising CodeGeneration and CodeEditing, which generates multiple solutions at different temperature values and replaces entire methods or classes to maintain code integrity. A feedback loop executes repository-level test cases to validate and refine solutions. Experiments conducted on the SWE-bench Lite dataset demonstrate SuperCoder2.0's effectiveness, achieving correct file localization in 84.33% of cases within the top 5 candidates and successfully resolving 34% of test instances. This performance places SuperCoder2.0 fourth globally on the SWE-bench leaderboard. The system's ability to handle diverse repositories and problem types highlights its potential as a versatile tool for autonomous software development. Future work will focus on refining the code editing process and exploring advanced embedding models for improved natural language to code mapping.

  • 5 authors
·
Sep 17, 2024

WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

Recent work demonstrates that, after being fine-tuned on a high-quality instruction dataset, the resulting model can obtain impressive capabilities to address a wide range of tasks. However, existing methods for instruction data generation often produce duplicate data and are not controllable enough on data quality. In this paper, we extend the generalization of instruction tuning by classifying the instruction data to 4 code-related tasks and propose a LLM-based Generator-Discriminator data process framework to generate diverse, high-quality instruction data from open source code. Hence, we introduce CodeOcean, a dataset comprising 20,000 instruction instances across 4 universal code-related tasks,which is aimed at augmenting the effectiveness of instruction tuning and improving the generalization ability of fine-tuned model. Subsequently, we present WaveCoder, a fine-tuned Code LLM with Widespread And Versatile Enhanced instruction tuning. This model is specifically designed for enhancing instruction tuning of Code Language Models (LLMs). Our experiments demonstrate that Wavecoder models outperform other open-source models in terms of generalization ability across different code-related tasks at the same level of fine-tuning scale. Moreover, Wavecoder exhibits high efficiency in previous code generation tasks. This paper thus offers a significant contribution to the field of instruction data generation and fine-tuning models, providing new insights and tools for enhancing performance in code-related tasks.

  • 8 authors
·
Dec 20, 2023 5

FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding

The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as "vibe coding," where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent's vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for "aggressive implementation," a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.

  • 3 authors
·
Sep 26, 2025

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

In the evolving landscape of large language models (LLMs) tailored for software engineering, the need for benchmarks that accurately reflect real-world development scenarios is paramount. Current benchmarks are either too simplistic or fail to capture the multi-tasking nature of software development. To address this, we introduce CoderUJB, a new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java's prevalence in real-world software production. CoderUJB comprises 2,239 programming questions derived from 17 real open-source Java projects and spans five practical programming tasks. Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs, examining the effects of continued pre-training in specific programming languages code and instruction fine-tuning on their performance. The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation (e.g., test generation and defect detection). Importantly, our results advise caution in the specific programming languages continued pre-training and instruction fine-tuning, as these techniques could hinder model performance on certain tasks, suggesting the need for more nuanced strategies. CoderUJB thus marks a significant step towards more realistic evaluations of programming capabilities in LLMs, and our study provides valuable insights for the future development of these models in software engineering.

  • 5 authors
·
Mar 28, 2024

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

  • 17 authors
·
Sep 11, 2025 2

A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions: constraint type, hierarchical levels, and iterative refinement. Built upon a structured taxonomy of 9 categories and 27 constraint types, MultiCodeIF enables granular assessment of both functional and non-functional instruction adherence. Using an automated pipeline, ConstraGen, we synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants. Empirical evaluation of six state-of-the-art LLMs uncovers substantial performance disparities. The top-performing model, Claude-3-7-Sonnet, achieves 63.0% average constraint satisfaction, while smaller models like Qwen3-1.7B fall to 44.8%. Models perform well on explicit constraints, but struggle with implicit or abstract constraints. Tasks with multiple hierarchical constraints significantly reduce model success rates, from 54.5% in single-level to just 18.8% in multi-level scenarios. However, structured feedback enables progressive improvement: average constraint satisfaction rises from 63.0% to 83.4% over four iterative refinement rounds. MultiCodeIF provides a scalable, constraint-aware, and feedback-sensitive framework to benchmark LLMs under realistic code generation scenarios, bridging the gap between synthetic evaluations and real-world instruction complexity. The full benchmark dataset, evaluation pipeline, and source code are available at https://github.com/SYSUSELab/MultiCodeIF.

  • 6 authors
·
Jul 1, 2025

Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees

The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms. In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks. We will open-source our codes, data, models, and benchmarks to establish a common foundation for ISA-level code translation research.

LLM-Driven Multi-step Translation from C to Rust using Static Analysis

Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety while maintaining high performance. However, manual translation is cumbersome, error-prone, and produces unidiomatic code. Large language models (LLMs) have demonstrated promise in producing idiomatic translations, but offer no correctness guarantees as they lack the ability to capture all the semantics differences between the source and target languages. To resolve this issue, we propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology: an "unidiomatic" step to translate C into Rust while preserving semantics, and an "idiomatic" step to refine the code to follow Rust's semantic standards. SACTOR utilizes information provided by static analysis of the source C program to address challenges such as pointer semantics and dependency resolution. To validate the correctness of the translated result from each step, we use end-to-end testing via the foreign function interface to embed our translated code segment into the original code. We evaluate the translation of 200 programs from two datasets and two case studies, comparing the performance of GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.3 70B and DeepSeek-R1 in SACTOR. Our results demonstrate that SACTOR achieves high correctness and improved idiomaticity, with the best-performing model (DeepSeek-R1) reaching 93% and (GPT-4o, Claude 3.5, DeepSeek-R1) reaching 84% correctness (on each dataset, respectively), while producing more natural and Rust-compliant translations compared to existing methods.

  • 6 authors
·
Mar 16, 2025

MIGRATION-BENCH: Repository-Level Code Migration Benchmark from Java 8

With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark datasets have been developed to evaluate the coding capabilities of these models, while they primarily focus on problem-solving and issue-resolution tasks. In contrast, we introduce a new coding benchmark MIGRATION-BENCH with a distinct focus: code migration. MIGRATION-BENCH aims to serve as a comprehensive benchmark for migration from Java 8 to the latest long-term support (LTS) versions (Java 17, 21), MIGRATION-BENCH includes a full dataset and its subset selected with 5,102 and 300 repositories respectively. Selected is a representative subset curated for complexity and difficulty, offering a versatile resource to support research in the field of code migration. Additionally, we provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of LLMs on this challenging task. We further propose SD-Feedback and demonstrate that LLMs can effectively tackle repository-level code migration to Java 17. For the selected subset with Claude-3.5-Sonnet-v2, SD-Feedback achieves 62.33% and 27.00% success rate (pass@1) for minimal and maximal migration respectively. The benchmark dataset and source code are available at: https://huggingface.co/collections/AmazonScience and https://github.com/amazon-science/self_debug respectively.

  • 11 authors
·
May 14, 2025 2

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success.

  • 6 authors
·
Oct 13, 2023 1

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Effective instruction tuning is indispensable for optimizing code LLMs, aligning model behavior with user expectations and enhancing model performance in real-world applications. However, most existing methods focus on code snippets, which are limited to specific functionalities and rigid structures, restricting the complexity and diversity of the synthesized data. To address these limitations, we introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements, enabling the generation of more nuanced and diverse data. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features. This process enables the identification of more complex patterns and relationships within the code. By sampling subtrees with controlled depth and breadth, our framework allows precise adjustments to the complexity of the generated code, supporting a wide range of tasks from simple function-level operations to intricate multi-file scenarios. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels across multiple benchmarks. Notably, empirical evidence indicates that our approach shows significant potential in synthesizing highly complex repository-level code data. Further analysis elucidates the merits of this approach by rigorously assessing data complexity and diversity through software engineering principles and LLM-as-a-judge method.

  • 13 authors
·
Jan 8, 2025 2

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present CodeWiki, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce CodeWikiBench, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.

  • 4 authors
·
Oct 28, 2025

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

Recently, an increasing number of AI-driven programming assistants powered by code LLMs have been integrated into various real-world software development environments, significantly boosting developer productivity. However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of LLMs for specific application domains largely unknown. In this paper, we introduce a new benchmark, MultiCodeBench, to fill this gap. MultiCodeBench comprises 2,400 programming tasks, covering 12 popular software development domains and 15 programming languages. Specifically, we perform in-depth research to identify these 12 application domains. Given that each domain may involve multiple technical frameworks, and that different frameworks present distinct challenges in the coding process, we categorize the commonly used frameworks and platforms within each domain. We then sample programming problems from GitHub repositories related to these subdomains. To ensure the quality of the tasks and mitigate data leakage issues, we invite annotators to rewrite the docstrings for each task in MultiCodeBench. Additionally, we build a static analysis-based dependency parsing tool to extract the dependencies in the ground truth for each task, enabling deeper performance analysis. Through extensive experiments on MultiCodeBench with eleven representative mainstream LLMs, we reveal the code generation performance of the LLMs across different application domains, providing practical insights for developers in downstream fields when selecting LLMs. Furthermore, we analyze the reasons behind the models' failures in completing software application development tasks, offering guidance for model developers to enhance domain-specific code generation capabilities.

  • 5 authors
·
Dec 24, 2024

Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation

LLMs have become the mainstream approaches to code generation. Existing LLMs mainly employ autoregressive generation, i.e. generating code token-by-token from left to right. However, the underlying autoregressive generation has two limitations in code generation. First, autoregressive LLMs only generate a token at each step, showing low efficiency in practice. Second, programming is a non-sequential process involving back-and-forth editing, while autoregressive LLMs only employ the left-to-right generation order. These two intrinsic limitations hinder the further development of LLMs in code generation. Recently, diffusion LLMs have emerged as a promising alternative. Diffusion LLMs address the above limitations with two advances, including multi-token prediction (i.e. generating multiple tokens at each step) and flexible generation order (i.e. flexibly determining which positions to generate tokens). However, there is no systematic study exploring diffusion LLMs in code generation. To bridge the knowledge gap, we present the first empirical study of diffusion LLMs for code generation. Our study involves 9 representative diffusion LLMs and conduct experiments on 4 widely used benchmarks. Based on the results, we summarize the following findings. (1) Existing diffusion LLMs are competitive with autoregressive LLMs with similar sizes. (2) Diffusion LLMs have a stronger length extrapolation ability than autoregressive LLMs and perform better in long code understanding. (3) We explore factors impacting the effectiveness and efficiency of diffusion LLMs, and provide practical guidance. (4) We discuss several promising further directions to improve diffusion LLMs on code generation. We open-source all source code, data, and results to facilitate the following research. The code is publicly available at https://github.com/zhangyitonggg/dllm4code.

  • 5 authors
·
Sep 14, 2025

CoCoNUT: Structural Code Understanding does not fall out of a tree

Large Language Models (LLMs) have shown impressive performance across a wide array of tasks involving both structured and unstructured textual data. Recent results on various benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that high performance on such benchmarks does not correlate to humans' innate ability to understand structural control flow in code. To this end, we extract solutions from the HumanEval benchmark, which the relevant models perform strongly on, and trace their execution path using function calls sampled from the respective test set. Using this dataset, we investigate the ability of seven state-of-the-art LLMs to match the execution trace and find that, despite their ability to generate semantically identical code, they possess limited ability to trace execution paths, especially for longer traces and specific control structures. We find that even the top-performing model, Gemini, can fully and correctly generate only 47% of HumanEval task traces. Additionally, we introduce a subset for three key structures not contained in HumanEval: Recursion, Parallel Processing, and Object-Oriented Programming, including concepts like Inheritance and Polymorphism. Besides OOP, we show that none of the investigated models achieve an accuracy over 5% on the relevant traces. Aggregating these specialized parts with HumanEval tasks, we present Benchmark CoCoNUT: Code Control Flow for Navigation Understanding and Testing, which measures a model's ability to trace execution of code upon relevant calls, including advanced structural components. We conclude that current LLMs need significant improvement to enhance code reasoning abilities. We hope our dataset helps researchers bridge this gap.

  • 2 authors
·
Jan 27, 2025

Modularization is Better: Effective Code Generation with Modular Prompting

Large Language Models are transforming software development by automatically generating code. Current prompting techniques such as Chain-of-Thought (CoT) suggest tasks step by step and the reasoning process follows a linear structure, which hampers the understanding of complex programming problems, particularly those requiring hierarchical solutions. Inspired by the principle of modularization in software development, in this work, we propose a novel prompting technique, called MoT, to enhance the code generation performance of LLMs. At first, MoT exploits modularization principles to decompose complex programming problems into smaller, independent reasoning steps, enabling a more structured and interpretable problem-solving process. This hierarchical structure improves the LLM's ability to comprehend complex programming problems. Then, it structures the reasoning process using an MLR Graph (Multi-Level Reasoning Graph), which hierarchically organizes reasoning steps. This approach enhances modular understanding and ensures better alignment between reasoning steps and the generated code, significantly improving code generation performance. Our experiments on two advanced LLMs (GPT-4o-mini and DeepSeek-R1), comparing MoT to six baseline prompting techniques across six widely used datasets, HumanEval, HumanEval-ET, HumanEval+, MBPP, MBPP-ET, and MBPP+, demonstrate that MoT significantly outperforms existing baselines (e.g., CoT and SCoT), achieving Pass@1 scores ranging from 58.1% to 95.1%. The experimental results confirm that MoT significantly enhances the performance of LLM-based code generation.

  • 2 authors
·
Mar 16, 2025

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation

Recent advances in LLM-based decompilers have been shown effective to convert low-level binaries into human-readable source code. However, there still lacks a comprehensive benchmark that provides large-scale binary-source function pairs, which is critical for advancing the LLM decompilation technology. Creating accurate binary-source mappings incurs severe issues caused by complex compilation settings and widespread function inlining that obscure the correspondence between binaries and their original source code. Previous efforts have either relied on used contest-style benchmarks, synthetic binary-source mappings that diverge significantly from the mappings in real world, or partially matched binaries with only code lines or variable names, compromising the effectiveness of analyzing the binary functionality. To alleviate these issues, we introduce Decompile-Bench, the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i.e., 450GB of binaries compiled from permissively licensed GitHub projects. For the evaluation purposes, we also developed a benchmark Decompile-Bench-Eval including manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues. We further explore commonly-used evaluation metrics to provide a thorough assessment of the studied LLM decompilers and find that fine-tuning with Decompile-Bench causes a 20% improvement over previous benchmarks in terms of the re-executability rate. Our code and data has been released in HuggingFace and Github. https://github.com/albertan017/LLM4Decompile

  • 9 authors
·
May 18, 2025

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: How can personally deployable open-source LLMs achieve comparable code reasoning performance? To this end, we propose a unified Test-Time Compute scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a development-contextualized trajectory synthesis method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel development-process-based search strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing "end-point only" verification methods. Evaluations on SWE-bench Verified demonstrate our 32B model achieves a 46\% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that models dynamically allocate more tokens to increasingly challenging problems, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research. https://github.com/yingweima2022/SWE-Reasoner

  • 8 authors
·
Mar 31, 2025

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Large Language Models (LLMs) have demonstrated remarkable performance on coding related tasks, particularly on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations. First, most benchmarks are deficient as they focus on a narrow range of popular programming languages and specific tasks, whereas the real-world software development scenarios show dire need to implement systems with multilingual programming environments to satisfy diverse requirements. Practical programming practices also strongly expect multi-task settings for testing coding capabilities of LLMs comprehensively and robustly. Second, most benchmarks also fail to consider the actual executability and the consistency of execution results of the generated code. To bridge these gaps between existing benchmarks and expectations from practical applications, we introduce CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for comprehensively gauging LLM capabilities on coding tasks. CodeScope covers 43 programming languages and 8 coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): difficulty, efficiency, and length. To facilitate execution-based evaluations of code generation, we develop MultiCodeEngine, an automated code execution engine that supports 14 programming languages. Finally, we systematically evaluate and analyze 8 mainstream LLMs on CodeScope tasks and demonstrate the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks. The CodeScope benchmark and datasets are publicly available at https://github.com/WeixiangYAN/CodeScope.

  • 11 authors
·
Nov 14, 2023

Structured Chain-of-Thought Prompting for Code Generation

Large Language Models (LLMs) (e.g., ChatGPT) have shown impressive performance in code generation. LLMs take prompts as inputs, and Chain-of-Thought (CoT) prompting is the state-of-the-art prompting technique. CoT prompting asks LLMs first to generate CoTs (i.e., intermediate natural language reasoning steps) and then output the code. However, CoT prompting is designed for natural language generation and has low accuracy in code generation. In this paper, we propose Structured CoTs (SCoTs) and present a novel prompting technique for code generation, named SCoT prompting. Our motivation is source code contains rich structural information and any code can be composed of three program structures (i.e., sequence, branch, and loop structures). Intuitively, structured intermediate reasoning steps make for structured source code. Thus, we ask LLMs to use program structures to build CoTs, obtaining SCoTs. Then, LLMs generate the final code based on SCoTs. Compared to CoT prompting, SCoT prompting explicitly constrains LLMs to think about how to solve requirements from the view of source code and further the performance of LLMs in code generation. We apply SCoT prompting to two LLMs (i.e., ChatGPT and Codex) and evaluate it on three benchmarks (i.e., HumanEval, MBPP, and MBCPP). (1) SCoT prompting outperforms the state-of-the-art baseline - CoT prompting by up to 13.79% in Pass@1. (2) Human evaluation shows human developers prefer programs from SCoT prompting. (3) SCoT prompting is robust to examples and achieves substantial improvements.

  • 4 authors
·
May 11, 2023

Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository

LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Prior research treats class-level generation as an isolated task, neglecting the intricate dependencies & interactions that characterize real-world software environments. To address this gap, we introduce RepoClassBench, a comprehensive benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages & under various settings. Our findings emphasize the critical need for code-generation benchmarks to incorporate repo-level dependencies to more accurately reflect the complexities of software development. Our work shows the benefits of leveraging specialized tools to enhance LLMs' understanding of repository context. We plan to make our dataset & evaluation harness public.

  • 7 authors
·
Apr 21, 2024

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To evaluate the effectiveness of these models, multiple existing benchmarks are proposed, including only cases of generating a standalone function, i.e., a function that may invoke or access only built-in functions and standard libraries. However, non-standalone functions, which typically are not included in the existing benchmarks, constitute more than 70% of the functions in popular open-source projects, and evaluating models' effectiveness on standalone functions cannot reflect these models' effectiveness on pragmatic code generation scenarios. To help bridge the preceding gap, in this paper, we propose a benchmark named CoderEval, consisting of 230 Python and 230 Java code generation tasks carefully curated from popular real-world open-source projects and a self-contained execution platform to automatically assess the functional correctness of generated code. CoderEval supports code generation tasks from six levels of context dependency, where context refers to code elements such as types, APIs, variables, and consts defined outside the function under generation but within the dependent third-party libraries, current class, file, or project. CoderEval can be used to evaluate the effectiveness of models in generating code beyond only standalone functions. By evaluating three code generation models on CoderEval, we find that the effectiveness of these models in generating standalone functions is substantially higher than that in generating non-standalone functions. Our analysis highlights the current progress and pinpoints future directions to further improve a model's effectiveness by leveraging contextual information for pragmatic code generation.

  • 10 authors
·
Feb 1, 2023

CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation

With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated testing, but also augments developer efficiency through improved maintainability and reusability of code. In this paper, we introduce CodeIF, the first benchmark specifically designed to assess the abilities of LLMs to adhere to task-oriented instructions within diverse code generation scenarios. CodeIF encompasses a broad range of tasks, including function synthesis, error debugging, algorithmic refactoring, and code explanation, thereby providing a comprehensive suite to evaluate model performance across varying complexity levels and programming domains. We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks. The experimental results offer valuable insights into how well current models align with human instructions, as well as the extent to which they can generate consistent, maintainable, and contextually relevant code. Our findings not only underscore the critical role that instruction-following LLMs can play in modern software development, but also illuminate pathways for future research aimed at enhancing their adaptability, reliability, and overall effectiveness in automated code generation.

  • 6 authors
·
Feb 26, 2025

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

The prominent large language models (LLMs) of today differ from past language models not only in size, but also in the fact that they are trained on a combination of natural language and formal language (code). As a medium between humans and computers, code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity. In this survey, we present an overview of the various benefits of integrating code into LLMs' training data. Specifically, beyond enhancing LLMs in code generation, we observe that these unique properties of code help (i) unlock the reasoning ability of LLMs, enabling their applications to a range of more complex natural language tasks; (ii) steer LLMs to produce structured and precise intermediate steps, which can then be connected to external execution ends through function calls; and (iii) take advantage of code compilation and execution environment, which also provides diverse feedback for model improvement. In addition, we trace how these profound capabilities of LLMs, brought by code, have led to their emergence as intelligent agents (IAs) in situations where the ability to understand instructions, decompose goals, plan and execute actions, and refine from feedback are crucial to their success on downstream tasks. Finally, we present several key challenges and future directions of empowering LLMs with code.

  • 12 authors
·
Jan 1, 2024 1

LocAgent: Graph-Guided LLM Agents for Code Localization

Code localization--identifying precisely where in a codebase changes need to be made--is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code sections. The challenge lies in bridging natural language problem descriptions with the appropriate code elements, often requiring reasoning across hierarchical structures and multiple dependencies. We introduce LocAgent, a framework that addresses code localization through graph-based representation. By parsing codebases into directed heterogeneous graphs, LocAgent creates a lightweight representation that captures code structures (files, classes, functions) and their dependencies (imports, invocations, inheritance), enabling LLM agents to effectively search and locate relevant entities through powerful multi-hop reasoning. Experimental results on real-world benchmarks demonstrate that our approach significantly enhances accuracy in code localization. Notably, our method with the fine-tuned Qwen-2.5-Coder-Instruct-32B model achieves comparable results to SOTA proprietary models at greatly reduced cost (approximately 86% reduction), reaching up to 92.7% accuracy on file-level localization while improving downstream GitHub issue resolution success rates by 12% for multiple attempts (Pass@10). Our code is available at https://github.com/gersteinlab/LocAgent.

  • 9 authors
·
Mar 12, 2025 2

DiffTester: Accelerating Unit Test Generation for Diffusion LLMs via Repetitive Pattern

Software development relies heavily on extensive unit testing, which makes the efficiency of automated Unit Test Generation (UTG) particularly important. However, most existing LLMs generate test cases one token at a time in each forward pass, which leads to inefficient UTG. Recently, diffusion LLMs (dLLMs) have emerged, offering promising parallel generation capabilities and showing strong potential for efficient UTG. Despite this advantage, their application to UTG is still constrained by a clear trade-off between efficiency and test quality, since increasing the number of tokens generated in each step often causes a sharp decline in the quality of test cases. To overcome this limitation, we present DiffTester, an acceleration framework specifically tailored for dLLMs in UTG. The key idea of DiffTester is that unit tests targeting the same focal method often share repetitive structural patterns. By dynamically identifying these common patterns through abstract syntax tree analysis during generation, DiffTester adaptively increases the number of tokens produced at each step without compromising the quality of the output. To enable comprehensive evaluation, we extend the original TestEval benchmark, which was limited to Python, by introducing additional programming languages including Java and C++. Extensive experiments on three benchmarks with two representative models show that DiffTester delivers significant acceleration while preserving test coverage. Moreover, DiffTester generalizes well across different dLLMs and programming languages, providing a practical and scalable solution for efficient UTG in software development. Code and data are publicly available at https://github.com/wellbeingyang/DLM4UTG-open .

  • 4 authors
·
Sep 29, 2025 2

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limit models' capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at https://codesense-bench.github.io/.

  • 7 authors
·
May 31, 2025

CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification

Large Language Models (LLMs) have made significant advancements in the field of code generation, offering unprecedented support for automated programming and assisting developers. However, LLMs sometimes generate code that appears plausible but fails to meet the expected requirements or executes incorrectly. This phenomenon of hallucinations in the coding field has not been explored. To advance the community's understanding and research on code hallucinations in LLMs, we propose a definition method for these hallucinations based on execution verification and introduce the concept of code hallucinations for the first time. We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, each further divided into different subcategories to better understand and address the unique challenges faced by LLMs during code generation. To systematically evaluate code hallucinations, we propose a dynamic detection algorithm for code hallucinations and construct the CodeHalu benchmark, which includes 8,883 samples from 699 tasks, to actively detect hallucination phenomena in LLMs during programming. We tested 16 popular LLMs on this benchmark to evaluate the frequency and nature of their hallucinations during code generation. The findings reveal significant variations in the accuracy and reliability of LLMs in generating code, highlighting the urgent need to improve models and training methods to ensure the functional correctness and safety of automatically generated code. This study not only classifies and quantifies code hallucinations but also provides insights for future improvements in LLM-based code generation research. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.

  • 7 authors
·
Apr 30, 2024

CodeLSI: Leveraging Foundation Models for Automated Code Generation with Low-Rank Optimization and Domain-Specific Instruction Tuning

Context: Automated code generation using Foundation Models (FMs) offers promising solutions for enhancing software development efficiency. However, challenges remain in ensuring domain specificity, cost-effectiveness, and security - especially when relying on third-party APIs. This paper introduces CodeLSI, a framework that combines low-rank optimization and domain-specific instruction tuning to address these challenges. Objectives: The aim of this study is to develop and evaluate CodeLSI, a novel approach for generating high-quality code tailored to specific domains, using FMs fine-tuned on company infrastructure without dependence on external APIs. Methods: CodeLSI applies low-rank adaptation techniques to reduce the computational cost of model pre-training and fine-tuning. Domain-specific instruction tuning is employed to align code generation with organizational needs. We implemented and tested the framework on real-world JavaScript coding tasks using datasets drawn from internal software projects. Results: Experimental evaluations show that CodeLSI produces high-quality, context aware code. It outperforms baseline models in terms of relevance, accuracy, and domain fit. The use of low-rank optimization significantly reduced resource requirements, enabling scalable training on company-owned infrastructure. Conclusion: CodeLSI demonstrates that combining low-rank optimization with domain specific tuning can enhance the practicality and performance of FMs for automated code generation. This approach provides a secure, cost-efficient alternative to commercial API based solutions and supports faster, more targeted innovation in software development.

  • 7 authors
·
Sep 17, 2025

LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning

The automation of code review activities, a long-standing pursuit in software engineering, has been primarily addressed by numerous domain-specific pre-trained models. Despite their success, these models frequently demand extensive resources for pre-training from scratch. In contrast, Large Language Models (LLMs) provide an intriguing alternative, given their remarkable capabilities when supplemented with domain-specific knowledge. However, their potential for automating code review tasks remains largely unexplored. In response to this research gap, we present LLaMA-Reviewer, an innovative framework that leverages the capabilities of LLaMA, a popular LLM, in the realm of code review. Mindful of resource constraints, this framework employs parameter-efficient fine-tuning (PEFT) methods, delivering high performance while using less than 1% of trainable parameters. An extensive evaluation of LLaMA-Reviewer is conducted on two diverse, publicly available datasets. Notably, even with the smallest LLaMA base model consisting of 6.7B parameters and a limited number of tuning epochs, LLaMA-Reviewer equals the performance of existing code-review-focused models. The ablation experiments provide insights into the influence of various fine-tuning process components, including input representation, instruction tuning, and different PEFT methods. To foster continuous progress in this field, the code and all PEFT-weight plugins have been made open-source.

  • 5 authors
·
Aug 21, 2023 4

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

Code completion models have made significant progress in recent years. Recently, repository-level code completion has drawn more attention in modern software development, and several baseline methods and benchmarks have been proposed. However, existing repository-level code completion methods often fall short of fully using the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies. Besides, the existing benchmarks usually focus on limited code completion scenarios, which cannot reflect the repository-level code completion abilities well of existing methods. To address these limitations, we propose the R2C2-Coder to enhance and benchmark the real-world repository-level code completion abilities of code Large Language Models, where the R2C2-Coder includes a code prompt construction method R2C2-Enhance and a well-designed benchmark R2C2-Bench. Specifically, first, in R2C2-Enhance, we first construct the candidate retrieval pool and then assemble the completion prompt by retrieving from the retrieval pool for each completion cursor position. Second, based on R2C2 -Enhance, we can construct a more challenging and diverse R2C2-Bench with training, validation and test splits, where a context perturbation strategy is proposed to simulate the real-world repository-level code completion well. Extensive results on multiple benchmarks demonstrate the effectiveness of our R2C2-Coder.

  • 15 authors
·
Jun 3, 2024

System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts

Chain-of-thought (CoT) reasoning enables large language models (LLMs) to move beyond fast System-1 responses and engage in deliberative System-2 reasoning. However, this comes at the cost of significant inefficiency due to verbose intermediate output. Recent latent-space reasoning methods improve efficiency by operating on hidden states without decoding into language, yet they treat all steps uniformly, failing to distinguish critical deductions from auxiliary steps and resulting in suboptimal use of computational resources. In this paper, we propose System-1.5 Reasoning, an adaptive reasoning framework that dynamically allocates computation across reasoning steps through shortcut paths in latent space. Specifically, System-1.5 Reasoning introduces two types of dynamic shortcuts. The model depth shortcut (DS) adaptively reasons along the vertical depth by early exiting non-critical tokens through lightweight adapter branches, while allowing critical tokens to continue through deeper Transformer layers. The step shortcut (SS) reuses hidden states across the decoding steps to skip trivial steps and reason horizontally in latent space. Training System-1.5 Reasoning involves a two-stage self-distillation process: first distilling natural language CoT into latent-space continuous thought, and then distilling full-path System-2 latent reasoning into adaptive shortcut paths (System-1.5 Reasoning). Experiments on reasoning tasks demonstrate the superior performance of our method. For example, on GSM8K, System-1.5 Reasoning achieves reasoning performance comparable to traditional CoT fine-tuning methods while accelerating inference by over 20x and reducing token generation by 92.31% on average.

  • 4 authors
·
May 24, 2025 2

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.

  • 9 authors
·
May 7, 2024

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.

  • 9 authors
·
Feb 23, 2025 2

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco's code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.

  • 8 authors
·
Oct 5, 2025 2

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software development workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile all around code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use.

  • 46 authors
·
May 7, 2024 1

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

  • 4 authors
·
Oct 1, 2024 9

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack of semantic context: most benchmarks provide only code diffs without textual information such as issue descriptions, which are crucial for understanding developer intent. Data quality issues: without rigorous validation, many samples are noisy-e.g., reviews on outdated or irrelevant code-reducing evaluation reliability. Coarse granularity: most benchmarks operate at the file or commit level, overlooking the fine-grained, line-level reasoning essential for precise review. We introduce ContextCRBench, a high-quality, context-rich benchmark for fine-grained LLM evaluation in code review. Our construction pipeline comprises: Raw Data Crawling, collecting 153.7K issues and pull requests from top-tier repositories; Comprehensive Context Extraction, linking issue-PR pairs for textual context and extracting the full surrounding function or class for code context; and Multi-stage Data Filtering, combining rule-based and LLM-based validation to remove outdated, malformed, or low-value samples, resulting in 67,910 context-enriched entries. ContextCRBench supports three evaluation scenarios aligned with the review workflow: hunk-level quality assessment, line-level defect localization, and line-level comment generation. Evaluating eight leading LLMs (four closed-source and four open-source) reveals that textual context yields greater performance gains than code context alone, while current LLMs remain far from human-level review ability. Deployed at ByteDance, ContextCRBench drives a self-evolving code review system, improving performance by 61.98% and demonstrating its robustness and industrial utility. https://github.com/kinesiatricssxilm14/ContextCRBench.

  • 8 authors
·
Nov 10, 2025

AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities

AI agentic programming is an emerging paradigm in which large language models (LLMs) autonomously plan, execute, and interact with external tools like compilers, debuggers, and version control systems to iteratively perform complex software development tasks. Unlike conventional code generation tools, agentic systems are capable of decomposing high-level goals, coordinating multi-step processes, and adapting their behavior based on intermediate feedback. These capabilities are transforming the software development practice. As this emerging field evolves rapidly, there is a need to define its scope, consolidate its technical foundations, and identify open research challenges. This survey provides a comprehensive and timely review of AI agentic programming. We introduce a taxonomy of agent behaviors and system architectures, and examine core techniques including planning, memory and context management, tool integration, and execution monitoring. We also analyze existing benchmarks and evaluation methodologies used to assess coding agent performance. Our study identifies several key challenges, including limitations in handling long context, a lack of persistent memory across tasks, and concerns around safety, alignment with user intent, and collaboration with human developers. We discuss emerging opportunities to improve the reliability, adaptability, and transparency of agentic systems. By synthesizing recent advances and outlining future directions, this survey aims to provide a foundation for research and development in building the next generation of intelligent and trustworthy AI coding agents.

  • 4 authors
·
Aug 14, 2025

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving domain-specific coding tasks (e.g., computation, system, cryptography) unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs' coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at computation tasks while falling short on cryptography and system coding tasks. The performance gap can be as much as 68.94% (80.94% - 12.0%) in some LLMs. We also observe that generating more samples can increase the overall performance of LLMs, while the domain bias may even increase. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements. The leaderboard is available at https://domaineval.github.io/.

  • 7 authors
·
Aug 23, 2024

From CISC to RISC: language-model guided assembly transpilation

The transition from x86 to ARM architecture is becoming increasingly common across various domains, primarily driven by ARM's energy efficiency and improved performance across traditional sectors. However, this ISA shift poses significant challenges, mainly due to the extensive legacy ecosystem of x86 software and lack of portability across proprietary ecosystems and software stacks. This paper introduces CRT, a lightweight LLM-based transpiler that automatically converts x86 assembly to ARM assembly. Our approach bridges the fundamental architectural gap between x86's CISC-based and ARM's RISC-based computing paradigms while preserving program semantics and optimizing performance. We evaluate CRT on diverse real-world applications, achieving 79.25% translation accuracy from x86 to ARMv5 on our comprehensive test suite, and an 88.68% accuracy from x86 to RISC-V. In practical deployments on Apple M2 hardware (ARMv8), our transpiled code achieves 1.73times speedup compared to Apple's Rosetta 2 virtualization engine, while delivering 2.41times memory efficiency and 1.47times better energy consumption. Through testing and analysis, we show that CRT successfully navigates the CISC/RISC divide and generates correctly executable RISC code despite machine ``language'' barriers. We release our code, models, training datasets, and benchmarks at: https://ahmedheakl.github.io/asm2asm/.

Reasoning Runtime Behavior of a Program with LLM: How Far Are We?

Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs. Our code, data, and \newname leaderboard are available at https://r-eval.github.io.

  • 6 authors
·
Mar 25, 2024

Effi-Code: Unleashing Code Efficiency in Language Models

As the use of large language models (LLMs) for code generation becomes more prevalent in software development, it is critical to enhance both the efficiency and correctness of the generated code. Existing methods and models primarily focus on the correctness of LLM-generated code, ignoring efficiency. In this work, we present Effi-Code, an approach to enhancing code generation in LLMs that can improve both efficiency and correctness. We introduce a Self-Optimization process based on Overhead Profiling that leverages open-source LLMs to generate a high-quality dataset of correct and efficient code samples. This dataset is then used to fine-tune various LLMs. Our method involves the iterative refinement of generated code, guided by runtime performance metrics and correctness checks. Extensive experiments demonstrate that models fine-tuned on the Effi-Code show significant improvements in both code correctness and efficiency across task types. For example, the pass@1 of DeepSeek-Coder-6.7B-Instruct generated code increases from 43.3\% to 76.8\%, and the average execution time for the same correct tasks decreases by 30.5\%. Effi-Code offers a scalable and generalizable approach to improving code generation in AI systems, with potential applications in software development, algorithm design, and computational problem-solving. The source code of Effi-Code was released in https://github.com/huangd1999/Effi-Code.

  • 9 authors
·
Oct 14, 2024

Stable Code Technical Report

We introduce Stable Code, the first in our new-generation of code language models series, which serves as a general-purpose base code language model targeting code completion, reasoning, math, and other software engineering-based tasks. Additionally, we introduce an instruction variant named Stable Code Instruct that allows conversing with the model in a natural chat interface for performing question-answering and instruction-based tasks. In this technical report, we detail the data and training procedure leading to both models. Their weights are available via Hugging Face for anyone to download and use at https://huggingface.co/stabilityai/stable-code-3b and https://huggingface.co/stabilityai/stable-code-instruct-3b. This report contains thorough evaluations of the models, including multilingual programming benchmarks, and the MT benchmark focusing on multi-turn dialogues. At the time of its release, Stable Code is the state-of-the-art open model under 3B parameters and even performs comparably to larger models of sizes 7 billion and 15 billion parameters on the popular Multi-PL benchmark. Stable Code Instruct also exhibits state-of-the-art performance on the MT-Bench coding tasks and on Multi-PL completion compared to other instruction tuned models. Given its appealing small size, we also provide throughput measurements on a number of edge devices. In addition, we open source several quantized checkpoints and provide their performance metrics compared to the original model.

  • 11 authors
·
Apr 1, 2024

GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks often overlook this dynamic aspect, and the one that does consider it relies on static code prediction tasks without execution-based evaluation, offering a limited perspective on a model's practical usability. To address this gap, we introduce \GitChameleon{}, a novel, manually curated dataset comprising 116 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. is designed to rigorously assess the ability of modern large language models (LLMs) to generate version-specific code that is not only syntactically correct but also functionally accurate upon execution. Our comprehensive evaluations reveal that state-of-the-art LLMs struggle with this task; for instance, GPT-4o achieves a pass@10 of only 39.9\% (43.7\% when provided with error feedback), highlighting the complexity of the problem and the limitations of current models. By providing an execution-based benchmark that emphasizes the dynamic nature of code libraries, serves as a critical tool to advance the development of more adaptable and reliable code generation models. For facilitation for further exploration of version-conditioned code generation, we make our code repository publicly accessible at https://github.com/NizarIslah/GitChameleon.

  • 7 authors
·
Nov 5, 2024 2

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM

LLMs have become the go-to choice for code generation tasks, with an exponential increase in the training, development, and usage of LLMs specifically for code generation. To evaluate the ability of LLMs on code, both academic and industry practitioners rely on popular handcrafted benchmarks. However, prior benchmarks contain only a very limited set of problems, both in quantity and variety. Further, due to popularity and age, many benchmarks are prone to data leakage where example solutions can be readily found on the web and thus potentially in training data. Such limitations inevitably lead us to inquire: Is the leaderboard performance on existing benchmarks reliable and comprehensive enough to measure the program synthesis ability of LLMs? To address this, we introduce EvoEval -- a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains for a comprehensive evaluation of LLM coding abilities. Our study on 51 LLMs shows that compared to the high performance obtained on standard benchmarks like HumanEval, there is a significant drop in performance (on average 39.4%) when using EvoEval. Additionally, the decrease in performance can range from 19.6% to 47.7%, leading to drastic ranking changes amongst LLMs and showing potential overfitting of existing benchmarks. Furthermore, we showcase various insights, including the brittleness of instruction-following models when encountering rewording or subtle changes as well as the importance of learning problem composition and decomposition. EvoEval not only provides comprehensive benchmarks, but can be used to further evolve arbitrary problems to keep up with advances and the ever-changing landscape of LLMs for code. We have open-sourced our benchmarks, tools, and complete LLM generations at https://github.com/evo-eval/evoeval

  • 3 authors
·
Mar 27, 2024

Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond

Recently, fine-tuning pre-trained code models such as CodeBERT on downstream tasks has achieved great success in many software testing and analysis tasks. While effective and prevalent, fine-tuning the pre-trained parameters incurs a large computational cost. In this paper, we conduct an extensive experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We then propose efficient alternatives to fine-tune the large pre-trained code model based on the above findings. Our experimental study shows that (1) lexical, syntactic and structural properties of source code are encoded in the lower, intermediate, and higher layers, respectively, while the semantic property spans across the entire model. (2) The process of fine-tuning preserves most of the code properties. Specifically, the basic code properties captured by lower and intermediate layers are still preserved during fine-tuning. Furthermore, we find that only the representations of the top two layers change most during fine-tuning for various downstream tasks. (3) Based on the above findings, we propose Telly to efficiently fine-tune pre-trained code models via layer freezing. The extensive experimental results on five various downstream tasks demonstrate that training parameters and the corresponding time cost are greatly reduced, while performances are similar or better. Replication package including source code, datasets, and online Appendix is available at: https://github.com/DeepSoftwareAnalytics/Telly.

  • 7 authors
·
Apr 11, 2023

Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal

Recently, Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces introduce substantial challenges in terms of training cost, inference latency, and deployment feasibility. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps. In this paper, we propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. It then enables a logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP teaches models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning in coding tasks. Experiments show that ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs. On the challenging LiveCodeBench v4_v5 benchmark, our approach reduces token generation by 23.5% and inference latency by 43.5% compared to the strongest baseline, while achieving a competitive accuracy of 36.19% in Pass@1. Our results highlight a promising direction for building powerful and efficient LRMs.

  • 7 authors
·
Aug 7, 2025 3

SelfCodeAlign: Self-Alignment for Code Generation

Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.

  • 10 authors
·
Oct 31, 2024 2

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

As LLMs see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work has demonstrated that security is often overlooked, exposing that LLMs are prone to generating code with security vulnerabilities. These insights were enabled by specialized benchmarks, crafted through significant manual effort by security experts. However, relying on manually-crafted benchmarks is insufficient in the long term, because benchmarks (i) naturally end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, a framework that generates tasks and tests for code security benchmarking from scratch. We introduce a robust pipeline with fine-grained plausibility checks, leveraging the code understanding capabilities of LLMs to construct functionality tests and end-to-end security-probing exploits. To confirm the quality of the generated benchmark, we conduct both a qualitative analysis and perform quantitative experiments, comparing it against tasks constructed by human experts. We use AutoBaxBuilder to construct entirely new tasks and release them to the public as AutoBaxBench, together with a thorough evaluation of the security capabilities of LLMs on these tasks. We find that a new task can be generated in under 2 hours, costing less than USD 10.

  • 5 authors
·
Dec 24, 2025

RLCoder: Reinforcement Learning for Repository-Level Code Completion

Repository-level code completion aims to generate code for unfinished code snippets within the context of a specified repository. Existing approaches mainly rely on retrieval-augmented generation strategies due to limitations in input sequence length. However, traditional lexical-based retrieval methods like BM25 struggle to capture code semantics, while model-based retrieval methods face challenges due to the lack of labeled data for training. Therefore, we propose RLCoder, a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data. Specifically, we iteratively evaluate the usefulness of retrieved content based on the perplexity of the target code when provided with the retrieved content as additional context, and provide feedback to update the retriever parameters. This iterative process enables the retriever to learn from its successes and failures, gradually improving its ability to retrieve relevant and high-quality content. Considering that not all situations require information beyond code files and not all retrieved context is helpful for generation, we also introduce a stop signal mechanism, allowing the retriever to decide when to retrieve and which candidates to retain autonomously. Extensive experimental results demonstrate that RLCoder consistently outperforms state-of-the-art methods on CrossCodeEval and RepoEval, achieving 12.2% EM improvement over previous methods. Moreover, experiments show that our framework can generalize across different programming languages and further improve previous methods like RepoCoder. We provide the code and data at https://github.com/DeepSoftwareAnalytics/RLCoder.

  • 7 authors
·
Jul 28, 2024

AutoCodeRover: Autonomous Program Improvement

Researchers have made significant progress in automating the software development process in the past decades. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding. Nevertheless, software engineering involves the process of program improvement apart from coding, specifically to enable software maintenance (e.g. bug fixing) and software evolution (e.g. feature additions). In this paper, we propose an automated approach for solving GitHub issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM's understanding of the issue's root cause, and effectively retrieve a context via iterative search. The use of spectrum-based fault localization using tests, further sharpens the context, as long as a test-suite is available. Experiments on SWE-bench-lite (300 real-life GitHub issues) show increased efficacy in solving GitHub issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported SWE-agent. In addition, AutoCodeRover achieved this efficacy with significantly lower cost (on average, $0.43 USD), compared to other baselines. We posit that our workflow enables autonomous software engineering, where, in future, auto-generated code from LLMs can be autonomously improved.

  • 4 authors
·
Apr 8, 2024

Idioms: Neural Decompilation With Joint Code and Type Prediction

Decompilers are important tools for reverse engineers that help them analyze software at a higher level of abstraction than assembly. Unfortunately, because compilation is lossy, deterministic decompilers produce code that is missing many of the details that make source code readable in the first place, like variable names and types. Neural decompilers, on the other hand, offer the ability to statistically fill in these details. Existing work in neural decompilation, however, suffers from substantial drawbacks that limits its ability to handle real code: it is unable to handle user-defined composite types, which are essential to fully specifying many functions' semantics, or require test cases. In this work, we introduce a new training process to finetune any LLM into a neural decompiler capable of generating the appropriate user-defined types alongside the decompilation. We introduce a new dataset, Realtype, that includes substantially more complicated and realistic types than existing neural decompilation benchmarks. Motivated by the intuition that different parts of data structures can be operated upon by different parts of the program, we show that interprocedural context can help improve neural decompilers' ability to handle user-defined types. We show that our training process yields state-of-the-art results in neural decompilation. We also publicly release the Idioms series of finetuned neural decompilation models in support of open science. In summary, we identify the need for joint code and type prediction, show that it is a hard problem, and take the first steps towards solving it.

  • 3 authors
·
Feb 6, 2025

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.

  • 17 authors
·
Jan 2, 2025 6

Evaluating and Aligning CodeLLMs on Human Preference

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\url{https://codearenaeval.github.io/ }

  • 10 authors
·
Dec 6, 2024 2