Foundational Execution Capabilities: Software Engineering and Verifiable Domains

Code data and model capabilities in 2026

Over the past year, frontier AI labs have invested heavily in improving foundation model performance on software engineering tasks. While models have made remarkable strides in code generation, the gap between generating syntactically correct code and producing reliable, production-grade software remains significant. In this report, we examine the foundational execution capabilities that define state-of-the-art performance across software engineering and verifiable domains.

Execution-first evaluation methodology

Traditional benchmarks for code generation focus on surface-level metrics like pass@k on isolated function completions. Our methodology diverges in several critical ways:

End-to-end execution. Every generated solution is compiled, executed, and validated against a test suite that covers edge cases, performance constraints, and integration boundaries.
Multi-file context. Tasks require reasoning across multiple files, modules, and dependency graphs rather than isolated function-level completions.
Verifiable correctness. Domains like theorem proving, constraint satisfaction, and type-checked transformations provide formal verification of solution correctness.

This approach yields a more reliable signal about real-world model capabilities, bridging the gap between benchmark performance and practical deployment.

Figure 1 — Pass rates across execution domains

Function completion

94%

Multi-file edits

71%

End-to-end test pass

58%

Verified domains

43%

Core benchmarks and methodology

We evaluate across four primary benchmark suites, each targeting a distinct dimension of software engineering capability. The benchmark suite was constructed in collaboration with domain experts from frontier AI labs and top-tier engineering organisations.

Each task is annotated with difficulty ratings, expected time-to-complete for senior engineers, and fine-grained rubrics that decompose correctness into functional, structural, and performance dimensions. The rubrics were validated through an inter-annotator agreement study with a Cohen's kappa exceeding 0.85.

Example — multi-file task specification

{
  "task_id": "mfe-0042",
  "domain": "web_application",
  "files": [
    "src/api/routes.ts",
    "src/middleware/auth.ts",
    "src/models/user.ts",
    "tests/integration/auth.test.ts"
  ],
  "objective": "Implement JWT refresh token rotation
    with automatic revocation on reuse detection",
  "constraints": {
    "max_latency_ms": 50,
    "must_pass_tests": true,
    "security_audit": "owasp_top_10"
  }
}

Results across execution domains

Our evaluation reveals a consistent pattern across all tested models: while isolated code generation capabilities have improved dramatically, the ability to reason about system-level constraints, manage state across files, and produce code that integrates cleanly into existing codebases remains the primary bottleneck.

Models trained on higher-quality, execution-verified data consistently outperform those trained on larger but unverified corpora. This finding underscores the importance of data quality over data quantity in advancing frontier model capabilities.

Domain	Baseline	+ Verified Data	Delta
Function synthesis	87.2%	91.4%	+4.2%
Bug localisation	62.1%	74.8%	+12.7%
Multi-file refactor	41.6%	58.3%	+16.7%
Theorem proving	28.4%	39.1%	+10.7%
System design	33.9%	47.2%	+13.3%

Practical implications for engineering

These findings have direct implications for how AI systems are integrated into software engineering workflows. The key takeaways for practitioners:

Execution-verified training data produces measurably more reliable model outputs across all tested domains.
Multi-file reasoning remains the critical frontier. Investments in data that captures cross-file dependencies yield the highest marginal returns.
Formal verification domains provide an unambiguous signal for training. Models that learn from verified proofs generalise better to software engineering tasks.

Datacurve's role in the data pipeline

At Datacurve, we have built the infrastructure to generate, verify, and deliver the highest-quality execution data at scale. Our platform connects world-class software engineers with frontier AI labs, producing annotated datasets that meet the rigorous standards outlined in this report.

Every task on our platform goes through automated execution checks, human expert review, and continuous feedback loops that ensure data quality improves with each iteration. We believe this approach represents the most promising path toward AI systems that can reliably assist in real-world software engineering.

Get notified when new models drop

We re-run DeepSWE on new frontier models as they ship. Get the results in your inbox.