APEX Benchmarks

The APEX family of benchmarks assesses whether frontier AI models can perform economically valuable tasks across professional services, medicine, software engineering, and consumer activities.

APEX-Agents

The AI Productivity Index for Agents (APEX-Agents) measures whether frontier AI agents can execute long-horizon, cross-application tasks across three jobs in professional services.

GPT 5.5 (xHigh)

GPT 5.5 (xHigh)

38.4% ± 3.9%

GPT 5.4 (xHigh)

GPT 5.4 (xHigh)

36.0% ± 3.8%

GPT 5.2 (xHigh)

GPT 5.2 (xHigh)

34.4% ± 3.8%

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

APEX-SWE

The AI Productivity Index for Software Engineers (APEX-SWE) measures whether frontier AI agents can execute high-value engineering work, split across Observability and Integration tasks. Created by Mercor in collaboration with Cognition.

GPT 5.3 Codex (High)

GPT 5.3 Codex (High)

41.5% ± 6.3%

Opus 4.7 (Max)

Opus 4.7 (Max)

41.3% ± 6.3%

Opus 4.6 (High)

Opus 4.6 (High)

40.5% ± 6.3%

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

APEX

The AI Productivity Index (APEX) assesses whether frontier models are capable of performing economically valuable tasks across four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD).

GPT 5.4 (High)

GPT 5.4 (High)

67.2% ± 2.4%

Opus 4.6 (Max)

Opus 4.6 (Max)

65.7% ± 2.6%

Opus 4.6 (High)

Opus 4.6 (High)

65.3% ± 2.7%

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

APEX NEWSLETTER

The latest on frontier AI performance, straight to your inbox.

New benchmarks, leaderboard shifts, and research from the APEX team.

By subscribing you agree to receive updates from Mercor.
Unsubscribe anytime.