The AI Productivity Index for SWEs

The AI Productivity Index for Software Engineers (APEX-SWE) measures whether frontier AI systems can execute economically valuable software engineering work. It covers Integration and Observability tasks.

Get in touch Read research

The APEX-SWE leaderboard

We created APEX-SWE to evaluate the real day-to-day work of software engineers, unlike unit-level and single-repository bug-fix benchmarks. It comprises n=200 cases and spans two complementary settings that mirror professional SWE work: (1) Integration tasks that require end-to-end system construction and deployment across heterogeneous services and (2) Observability tasks that require debugging with production-style telemetry.

Each task includes a human-authored rubric that grades agent outputs for functional requirements, robustness, and code style, alongside unit tests.

To support open research, we have open-sourced n=50 cases that are in-distribution of APEX-SWE on Hugging Face with all metadata labels. We have also shared our eval harness for reproducibility.

Blog Paper Data Code Sample task

Model

Pass@1

Fable 5

65.5% ± 6.2%

Grok 4.5

51.2% ± 6.0%

Opus 4.8 (High)

47.3% ± 6.0%

Sonnet 5 (High)

43.6% ± 6.0%

GPT 5.3 Codex (High)

41.5% ± 6.3%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Domains covered in APEX-SWE

Integration

Evaluates a model's ability to orchestrate end-to-end workflows and synchronize data across heterogeneous services.

Grok 4.5

65.0%

GLM-5.2 (Thinking)

55.3%

Sonnet 5 (High)

54.3%

Observability

Evaluates a model's ability to diagnose and remediate real-world software engineering production failures.

Opus 4.8 (High)

43.3%

Grok 4.5

37.3%

GPT 5.3 Codex (High)

33.3%

The AI Productivity Index for SWEs

The APEX-SWE leaderboard

Domains covered in APEX-SWE

The latest on frontier AI performance, straight to your inbox.