Executable · Architecture-aware · Tool-augmented

Build-bench

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

Chenyu Zhao1, Shenglin Zhang1,*, Zeshun Huang1, Weilin Jin2, Yongqian Sun1, Dan Pei3, Chaoyun Zhang4, Qingwei Lin4, Chetan Bansal4, Saravan Rajmohan4, Minghua Ma4

1Nankai University · 2Peking University · 3Tsinghua University · 4Microsoft

*Corresponding author

Build-bench evaluates whether language-model agents can diagnose, repair, and verify real-world software packages that fail during migration between x86_64 and aarch64. It combines reproducible cross-ISA failures, standardized MCP-style tool orchestration, iterative build feedback, and executable verification on Open Build Service.

268failed packages
2migration directions
5failure classes
3repair iterations
Main results

Leaderboard

Build success rates and efficiency under the Build-bench protocol. The table reports the main full-file repair setting and includes Qwen2.5-3B-Instruct as a lightweight reproducible SLM baseline.

Build success rates and efficiency across two migration directions.
Rank Model / Agent x86_64 → aarch64 (%) aarch64 → x86_64 (%) Avg. Success (%) Avg. Time (min) Avg. Tokens (K)
1GPT-563.1929.5246.3624.871674.79
2GPT-5-mini28.8326.6727.7514.091789.28
3GPT-4o13.5012.3812.945.88577.89
4Qwen3-max17.185.7111.4531.58432.28
5Claude Sonnet 4.59.825.717.775.40330.88
6DeepSeek V37.983.815.9015.32340.28
7Qwen2.5-3B-Instruct4.911.903.4125.99482.86

Avg. Success is the unweighted mean of the two directional success rates. The main protocol uses Nmax = 3 repair iterations and Tmax = 20 tool calls per iteration.

Benchmark

Overview of Build-bench

Most software-engineering benchmarks evaluate source-level coding, issue resolution, or test-passing behavior in homogeneous environments. Build-bench targets a different capability: repairing software packages whose build breaks when migrating across instruction set architectures.

Each task starts from a package that builds successfully on a source architecture but fails on the target architecture. The agent must inspect the package, localize the build failure, modify code or configuration files, upload the repaired package to Open Build Service, and use the executable build result as feedback.

A repair is counted as successful only when the package rebuilds successfully on the target architecture.

268reproducible failed packages
163x86_64 → aarch64 failures
105aarch64 → x86_64 failures
95.90%LLM-assisted labels matching human consensus
Build-bench repair and verification workflow
Workflow. Failed packages are analyzed by an LLM-driven repair module, modified through external tools, rebuilt on Open Build Service, and iteratively refined using updated logs and previous repair content.
Corpus composition

Task categories

The 268 failures are grouped into five expert-validated classes that cover the build process from environment setup to runtime validation.

78 packages

Build Preparation Error

Missing macros, incompatible toolchains, unresolved dependencies, invalid compiler flags, or duplicate arguments that prevent configuration.

Repair signal: dependency declarations, architecture conditions, package metadata, and compiler setup.

116 packages

Compilation Error

Build-system incompatibilities, linker-level failures, type mismatches, missing headers, language-standard conflicts, or warnings promoted to errors.

Repair signal: source files, build scripts, toolchain behavior, and compiler diagnostics.

24 packages

Packaging Error

Missing binaries, unreferenced documentation, failing RPM scripts, failed install targets, duplicate installs, or policy violations.

Repair signal: install paths, spec sections, generated artifacts, and packaging rules.

42 packages

Test Failure

Functional assertion mismatches, restricted runtime environments, missing test dependencies, crashes, timeouts, or resource-related failures.

Repair signal: test logs, runtime behavior, dependency setup, and architecture-specific assumptions.

8 packages

Environment / Infrastructure Error

Host, VM, or virtualization interruptions that terminate the build outside ordinary source-level repair.

Repair signal: external build-service state and reproducibility checks.

Evaluation design

Inputs, tools, and verification loop

Step 1

Input & diagnosis context

Complete source package, source archive, specification and metadata files, build scripts, and architecture-specific failed build logs.

Step 2

Tool-augmented repair

Structure extraction, file-content extraction, decompression, compression, content modification, OBS upload, and build-result checking through a unified MCP-style interface.

Step 3

Executable validation

Rebuild the updated package on OBS and return the latest build result, build log, and previous repair content for the next iteration when needed.

Step 4

Metrics

Build success rate, average repair time, average token consumption, and tool invocation behavior under a fixed iteration and tool-call budget.

Repair modes

Repair strategies

Full File Generation

The model regenerates the complete faulty file while preserving structure, style, and necessary context.

  • Better for broad contextual changes.
  • More robust when cross-file consistency matters.
  • Higher latency and token cost.

Patch Generation

The model emits line-level unified diffs that are automatically applied to the source tree.

  • Faster and cheaper across most models.
  • Useful for localized edits.
  • Can fail when patch formatting or context is incomplete.
Comparison of full file generation and patch generation
Strategy comparison. Patch Generation substantially lowers latency and token cost, while Full File Generation often provides stronger robustness for complex structural changes.
Findings

Key results

Current LLMs can repair real cross-ISA failures, but the task remains difficult.

GPT-5 reaches the best full-file success rate, repairing 63.19% of x86_64 → aarch64 failures and 29.52% of aarch64 → x86_64 failures. All models still leave substantial headroom.

Agentic orchestration is essential.

Bare single-shot repair performs poorly: GPT-5 reaches only 6.13% in the forward direction without tool use and iterative feedback, compared with 63.19% in Build-bench.

Iterative feedback produces cumulative gains.

GPT-5 improves from 36.81% after the first iteration to 63.19% after the third in x86_64 → aarch64 migration.

Repair granularity changes the cost-completeness balance.

Patch Generation reduces GPT-5 forward repair time from 31.18 to 8.93 minutes and token usage from 1830.91K to 761.88K, while Full File Generation remains more robust for broader consistency.

Build success rates of evaluated LLMs
Model comparison. Build success rates across two migration directions and two repair modes. GPT-5 is strongest overall, but current models still struggle with large-scale heterogeneous repair.
Failure analysis

Where models still fail

Long and interleaved build logs

Models can miss the causal error among noisy compiler output, repeated dependency traces, and downstream packaging messages.

Incomplete or premature output

Some repairs are syntactically plausible but incomplete, especially when coordinating several files or regenerating long configuration content.

Incorrect tool sequences

Agents sometimes modify decompressed files but upload stale or unrecompressed packages, leading to invalid OBS submissions.

Redundant verification loops

Some models repeatedly check build results without substantive changes, increasing runtime without improving success.

Cross-file consistency failures

Architecture migration often requires synchronized changes across spec files, source code, compiler flags, and generated artifacts.

Tool invocation behavior across LLMs
Tool usage. Stronger agents inspect, edit, upload, and verify more effectively, while redundant checks can lead to costly non-convergent loops.
Appendix material

Full experimental tables

Paper Table 2. Performance of LLMs on cross-ISA build failures in both migration directions.
Performance table for LLMs on Build-bench
Paper Table 4. Iteration-wise improvement in build success rate across three repair iterations.
Iteration-wise improvement table for Build-bench
Paper Table 5. GPT-5 repair success rates by failure category under the two repair strategies.
Category results table for Build-bench

This breakdown shows that the corpus taxonomy is not only descriptive: repair performance varies by failure cause and by editing granularity. Patch Generation is often stronger for localized preparation or packaging fixes, while Full File Generation is more reliable for broader contextual regeneration in several forward-migration categories.

Background

Why it matters and what we contribute

Expand background and contributions
01

Architecture-aware system repair

Tasks involve package-level failures caused by heterogeneous ISAs, toolchains, dependencies, build scripts, packaging rules, and long build logs.

02

Executable verification

A task succeeds only when the modified package is rebuilt successfully on the target architecture through Open Build Service.

03

End-to-end evaluation framework

Repair attempts are validated by executable rebuilds, not by textual similarity, static prediction, or single-file tests.

04

Empirical baselines and insights

Results expose model gaps in log comprehension, procedural tool use, cross-file reasoning, and architecture-specific adaptation.

Reference

Citation

@misc{zhao2025languagemodelscodingassessing,
  title={Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems},
  author={Chenyu Zhao and Shenglin Zhang and Zeshun Huang and Weilin Jin and Yongqian Sun and Dan Pei and Chaoyun Zhang and Qingwei Lin and Chetan Bansal and Saravan Rajmohan and Minghua Ma},
  year={2025},
  eprint={2511.00780},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2511.00780}
}