APEX Benchmarks

The APEX family of benchmarks assesses whether frontier AI models can perform economically valuable tasks across professional services, medicine, software engineering, and consumer activities.

Get in touch

APEX-Agents

The AI Productivity Index for Agents (APEX-Agents) measures whether frontier AI agents can execute long-horizon, cross-application tasks across three jobs in professional services.

Blog Paper Data Code Sample task

Gemini 3.5 Flash (High)

49.6% ± 3.9%

Opus 4.8 (Max)

42.5% ± 4.0%

GPT 5.5 (xHigh)

38.4% ± 3.9%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

The AI Productivity Index for Software Engineers (APEX-SWE) measures whether frontier AI agents can execute high-value engineering work, split across Observability and Integration tasks. Created by Mercor in collaboration with Cognition.

Blog Paper Data Code Sample task

Opus 4.8 (High)

45.3% ± 6.3%

GPT 5.3 Codex (High)

41.5% ± 6.3%

Opus 4.7 (Max)

41.3% ± 6.3%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

APEX

The AI Productivity Index (APEX) assesses whether frontier models are capable of performing economically valuable tasks across four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD).

Blog Paper Data Code Sample task

GPT 5.4 (High)

67.2% ± 2.4%

Opus 4.6 (Max)

65.7% ± 2.6%

Opus 4.6 (High)

65.3% ± 2.7%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

APEX NEWSLETTER

The latest on frontier AI performance, straight to your inbox.

New benchmarks, leaderboard shifts, and research from the APEX team.