Research | Datacurve

Introducing DeepSWE

May 18, 2026Benchmark

Developers increasingly trust coding agents to independently complete engineering tasks that take hours, but popular public benchmarks like SWE-Bench Pro still evaluate them on single-file changes averaging just 120 lines of code, graded by verifiers that in our audit produced 8% false positives and 25% false negatives.

We introduce Deep SWE, a benchmark for long-horizon software engineering. Deep SWE tasks are substantially larger, with reference solutions averaging 668 lines of code.

The result is clearer separation between frontier models: models that existing public benchmarks often place close together, but that developers experience as meaningfully different. … read more

Get notified when new models drop

We re-run DeepSWE on new frontier models as they ship. Get the results in your inbox.

Join the research team

We're always on the lookout for great researchers and engineers who want to push the frontier of intelligence.

See open roles