The AI Productivity Index for Software Engineers (APEX-SWE) measures whether frontier AI systems can execute economically valuable software engineering work. It covers Integration and Observability tasks.
We created APEX-SWE to evaluate the real day-to-day work of software engineers, unlike unit-level and single-repository bug-fix benchmarks. It comprises n=200 cases and spans two complementary settings that mirror professional SWE work: (1) Integration tasks that require end-to-end system construction and deployment across heterogeneous services and (2) Observability tasks that require debugging with production-style telemetry.
Each task includes a human-authored rubric that grades agent outputs for functional requirements, robustness, and code style, alongside unit tests.
To support open research, we have open-sourced n=50 cases that are in-distribution of APEX-SWE on Hugging Face with all metadata labels. We have also shared our eval harness for reproducibility.
View more
Opus 4.5 (High)
50.7%
GPT 5.4 (High)
50.7%
GPT 5.3 Codex (High)
49.7%
View more
GPT 5.3 Codex (High)
33.3%
Opus 4.6 (High)
31.7%
Opus 4.5 (High)
26.7%