We develop benchmarks, evaluation environments, and large-scale human datasets to fuel AI breakthroughs at the frontier, all through our marketplace of top-tier experts.
View Benchmarks
Mercor is used by the top 5 AI labs and 6 of the Mag 7.
When model capabilities reach their limits, progress depends on data quality. Mercor's talent platform mobilizes deep subject-matter experts across professional and consumer domains to produce specialized data at scale.
Frontier-grade data unlocks advanced reasoning, long-horizon planning, tool use, and safe behavior under uncertainty. We power meaningful gains with novel datasets that are realistic, challenging, and diverse.
We build reinforcement learning (RL) environments in three steps: creating realistic data-rich worlds that capture real behavior, implementing the tools and applications that agents need to interact with the world, and making rigorous tasks and verifiers.



Benchmarks for evaluating the strengths and weaknesses of frontier models on high-value tasks
The AI Productivity Index for Agents (APEX-Agents) measures whether frontier AI agents can execute long-horizon, cross-application tasks across three jobs in professional services.
GPT 5.2 (xHigh)
48.2% ± 3.5%
Gemini 3.1 Pro (High)
48.1% ± 3.4%
Opus 4.6 (Max)
47.7% ± 3.4%
The AI Productivity Index (APEX) assesses whether frontier models are capable of performing economically valuable tasks across four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD).
GPT 5 (High)
67% ± 2.4%
GPT 5.2 Pro (High)
66.8% ± 2.6%
Gemini 3 Pro (High)
64.3% ± 2.3%
The AI Consumer Index (ACE) assesses whether frontier AI models can perform everyday consumer tasks in shopping, food, gaming, and DIY.
GPT 5 (High)
56.1% ± 3.3%
o3 Pro (High)
55.2% ± 3.2%
GPT 5.1 (High)
55.1% ± 3.2%
Read our latest insights in frontier data and AI research.
We're looking for exceptional people to join our Research and Engineering team.
View All Openings