Developers increasingly trust coding agents to independently complete engineering tasks that take hours, but popular public benchmarks like SWE-Bench Pro still evaluate them on single-file changes averaging just 120 lines of code, graded by verifiers that in our audit produced 8% false positives and 25% false negatives.
We introduce Deep SWE, a benchmark for long-horizon software engineering. Deep SWE tasks are substantially larger, with reference solutions averaging 668 lines of code.
The result is clearer separation between frontier models: models that existing public benchmarks often place close together, but that developers experience as meaningfully different. … read more