Build-bench: Can Language Models Go Beyond Coding?

Main results

Leaderboard

Build success rates and efficiency under the Build-bench protocol. The table reports the main full-file repair setting and includes Qwen2.5-3B-Instruct as a lightweight reproducible SLM baseline.

Build success rates and efficiency across two migration directions.
Rank	Model / Agent	x86_64 → aarch64 (%)	aarch64 → x86_64 (%)	Avg. Success (%)	Avg. Time (min)	Avg. Tokens (K)
1	GPT-5	63.19	29.52	46.36	24.87	1674.79
2	GPT-5-mini	28.83	26.67	27.75	14.09	1789.28
3	GPT-4o	13.50	12.38	12.94	5.88	577.89
4	Qwen3-max	17.18	5.71	11.45	31.58	432.28
5	Claude Sonnet 4.5	9.82	5.71	7.77	5.40	330.88
6	DeepSeek V3	7.98	3.81	5.90	15.32	340.28
7	Qwen2.5-3B-Instruct	4.91	1.90	3.41	25.99	482.86

Avg. Success is the unweighted mean of the two directional success rates. The main protocol uses N_max = 3 repair iterations and T_max = 20 tool calls per iteration.

Benchmark

Overview of Build-bench

Most software-engineering benchmarks evaluate source-level coding, issue resolution, or test-passing behavior in homogeneous environments. Build-bench targets a different capability: repairing software packages whose build breaks when migrating across instruction set architectures.

Each task starts from a package that builds successfully on a source architecture but fails on the target architecture. The agent must inspect the package, localize the build failure, modify code or configuration files, upload the repaired package to Open Build Service, and use the executable build result as feedback.

A repair is counted as successful only when the package rebuilds successfully on the target architecture.

268reproducible failed packages

163x86_64 → aarch64 failures

105aarch64 → x86_64 failures

95.90%LLM-assisted labels matching human consensus

Corpus composition

Task categories

The 268 failures are grouped into five expert-validated classes that cover the build process from environment setup to runtime validation.

78 packages

Build Preparation Error

Missing macros, incompatible toolchains, unresolved dependencies, invalid compiler flags, or duplicate arguments that prevent configuration.

Repair signal: dependency declarations, architecture conditions, package metadata, and compiler setup.

116 packages

Compilation Error

Build-system incompatibilities, linker-level failures, type mismatches, missing headers, language-standard conflicts, or warnings promoted to errors.

Repair signal: source files, build scripts, toolchain behavior, and compiler diagnostics.

24 packages

Packaging Error

Missing binaries, unreferenced documentation, failing RPM scripts, failed install targets, duplicate installs, or policy violations.

Repair signal: install paths, spec sections, generated artifacts, and packaging rules.

42 packages

Test Failure

Functional assertion mismatches, restricted runtime environments, missing test dependencies, crashes, timeouts, or resource-related failures.

Repair signal: test logs, runtime behavior, dependency setup, and architecture-specific assumptions.

8 packages

Environment / Infrastructure Error

Host, VM, or virtualization interruptions that terminate the build outside ordinary source-level repair.

Repair signal: external build-service state and reproducibility checks.

Evaluation design

Inputs, tools, and verification loop

Step 1

Input & diagnosis context

Complete source package, source archive, specification and metadata files, build scripts, and architecture-specific failed build logs.

Step 2

Tool-augmented repair

Structure extraction, file-content extraction, decompression, compression, content modification, OBS upload, and build-result checking through a unified MCP-style interface.

Step 3

Executable validation

Rebuild the updated package on OBS and return the latest build result, build log, and previous repair content for the next iteration when needed.

Step 4

Metrics

Build success rate, average repair time, average token consumption, and tool invocation behavior under a fixed iteration and tool-call budget.

Repair modes

Repair strategies

Full File Generation

The model regenerates the complete faulty file while preserving structure, style, and necessary context.

Better for broad contextual changes.
More robust when cross-file consistency matters.
Higher latency and token cost.

Patch Generation

The model emits line-level unified diffs that are automatically applied to the source tree.

Faster and cheaper across most models.
Useful for localized edits.
Can fail when patch formatting or context is incomplete.

Findings

Key results

Current LLMs can repair real cross-ISA failures, but the task remains difficult.

GPT-5 reaches the best full-file success rate, repairing 63.19% of x86_64 → aarch64 failures and 29.52% of aarch64 → x86_64 failures. All models still leave substantial headroom.

Agentic orchestration is essential.

Bare single-shot repair performs poorly: GPT-5 reaches only 6.13% in the forward direction without tool use and iterative feedback, compared with 63.19% in Build-bench.

Iterative feedback produces cumulative gains.

GPT-5 improves from 36.81% after the first iteration to 63.19% after the third in x86_64 → aarch64 migration.

Repair granularity changes the cost-completeness balance.

Patch Generation reduces GPT-5 forward repair time from 31.18 to 8.93 minutes and token usage from 1830.91K to 761.88K, while Full File Generation remains more robust for broader consistency.

Failure analysis

Where models still fail

Long and interleaved build logs

Models can miss the causal error among noisy compiler output, repeated dependency traces, and downstream packaging messages.

Incomplete or premature output

Some repairs are syntactically plausible but incomplete, especially when coordinating several files or regenerating long configuration content.

Incorrect tool sequences

Agents sometimes modify decompressed files but upload stale or unrecompressed packages, leading to invalid OBS submissions.

Redundant verification loops

Some models repeatedly check build results without substantive changes, increasing runtime without improving success.

Cross-file consistency failures

Architecture migration often requires synchronized changes across spec files, source code, compiler flags, and generated artifacts.

Appendix material

Full experimental tables

Performance table for LLMs on Build-bench — **Paper Table 2.** Performance of LLMs on cross-ISA build failures in both migration directions.

Iteration-wise improvement table for Build-bench — **Paper Table 4.** Iteration-wise improvement in build success rate across three repair iterations.

Category results table for Build-bench — **Paper Table 5.** GPT-5 repair success rates by failure category under the two repair strategies.

Background

Why it matters and what we contribute

Expand background and contributions

01

Architecture-aware system repair

Tasks involve package-level failures caused by heterogeneous ISAs, toolchains, dependencies, build scripts, packaging rules, and long build logs.

02

Executable verification

A task succeeds only when the modified package is rebuilt successfully on the target architecture through Open Build Service.

03

End-to-end evaluation framework

Repair attempts are validated by executable rebuilds, not by textual similarity, static prediction, or single-file tests.

04

Empirical baselines and insights

Results expose model gaps in log comprehension, procedural tool use, cross-file reasoning, and architecture-specific adaptation.

Reference

Citation

@misc{zhao2025languagemodelscodingassessing,
  title={Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems},
  author={Chenyu Zhao and Shenglin Zhang and Zeshun Huang and Weilin Jin and Yongqian Sun and Dan Pei and Chaoyun Zhang and Qingwei Lin and Chetan Bansal and Saravan Rajmohan and Minghua Ma},
  year={2025},
  eprint={2511.00780},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2511.00780}
}