The AI Productivity Index for SWEs

The AI Productivity Index for Software Engineers (APEX-SWE) measures whether frontier AI systems can execute economically valuable software engineering work. It covers Integration and Observability tasks.

The APEX-SWE leaderboard

We created APEX-SWE to evaluate the real day-to-day work of software engineers, unlike unit-level and single-repository bug-fix benchmarks. It comprises n=200 cases and spans two complementary settings that mirror professional SWE work: (1) Integration tasks that require end-to-end system construction and deployment across heterogeneous services and (2) Observability tasks that require debugging with production-style telemetry.

Each task includes a human-authored rubric that grades agent outputs for functional requirements, robustness, and code style, alongside unit tests.

To support open research, we have open-sourced n=50 cases that are in-distribution of APEX-SWE on Hugging Face with all metadata labels. We have also shared our eval harness for reproducibility.

Domains covered in APEX-SWE

Evaluates a model's ability to orchestrate end-to-end workflows and synchronize data across heterogeneous services.
gpt-5.5-xhigh

GPT 5.5 (xHigh)

52.7%

claude-opus-4-7

Opus 4.7 (Max)

52.0%

claude-opus-4-5-high

Opus 4.5 (High)

50.7%

Evaluates a model's ability to diagnose and remediate real-world software engineering production failures.
claude-opus-4-8-high

Opus 4.8 (High)

43.3%

gpt-5.3-codex-high

GPT 5.3 Codex (High)

33.3%

claude-opus-4-6-high

Opus 4.6 (High)

31.7%

APEX NEWSLETTER

The latest on frontier AI performance, straight to your inbox.

New benchmarks, leaderboard shifts, and research from the APEX team.

By subscribing you agree to receive updates from Mercor.
Unsubscribe anytime.