Welcome to The Era of Evals

Brendan Foody

Co-founder / CEO·Jun 30, 2025

4 min read

Reinforcement Learning (RL) is driving the most exciting advancements in AI. RL is becoming so effective that models will be able to saturate any evaluation. This means that the primary barrier to applying agents to the entire economy is building evals for everything. However, AI labs are facing a dire shortage of relevant evaluations. Academic evaluations that labs goal on don’t reflect what consumers and enterprises demand in the economy.

Evals are the new PRD. Progress in accelerating knowledge work will converge on building environments and evaluations that map real workspaces and deliverables. This new RL-centric paradigm of human data is vastly more data efficient than pretraining, SFT, or RLHF. Most knowledge work includes recurring workflows as variable costs, but creating an environment or evaluation can transform that into a one-time fixed cost.

Training on Verifiable Rewards

RL environments allow for rewarding outcomes and intermediate steps in an evaluation. Models take many attempts at a problem, using test-time compute to "think" before it answers. Human created autograders reward the attempts which were "good". Reinforcing on those "good" trajectories upweights the chains of thought that were used to get to the answer. This teaches models to think correctly about different types of problems as researchers iteratively hill climb evals.

These environments can be thought of as existing on a spectrum of rigidity between two categories:

Objective domains: Games, like pac-man, chess, and Go, have clear states spaces, action spaces, and desired outcomes. Math, code, and even some tasks in biology, can often be formulated with near game-like verifiability. This is where RL has achieved early massive success already, notably, AlphaProof, AlphaFold, and DeepSeek R1 and the many code generation models on the market today.

Subjective domains: It’s more difficult to measure accuracy in many real world tasks such as generating investment memos, making legal briefs, providing therapy. This makes it difficult to verify that a model achieved desired outcomes. Additionally, experts often support multiple valid opinions about desired processes and outcomes. Rubric-based rewards serve as a way to learn from the messiness of expert human opinions. How to evaluate and train with rubrics as environments is an exciting area of research with roots laid as early as constitutional AI and RLAIF work from Anthropic.

Computer-use agents sit somewhere in the middle. For most of the tasks humans do on computers, goals start to become ambiguous and multi-faceted. Once defined, the actions and outcomes are programmatic and verifiable. These could include planning trips, responding to emails, shopping, or posting on social media. In all of these cases, containerized environments allow for horizontal scaling to learn online from thousands of interactions in parallel.

Environments Create Experience

Eventually, our AI systems will learn automatically from signals in the real world like pupils’ test scores increasing, sales closing, maybe even bridges being built. However, intermediate rewards will always remain critical. Similar to how humans learn from other people, models will need guidance on which styles of teaching and sales techniques are most effective. Humans will remain an integral part of the environments models learn from.

We will never escape the era of data; it must follow us to the frontier. That frontier is human created environments that provide durable sources of experiential data. These environments can serve to train and evaluate models.

The Path Forward

Meeting today’s data demand requires rethinking the way we generate signal from human efforts. Creating evals and RL environments is the highest leverage and most durable use of people’s time. Mercor has helped pioneer environment generation using autograders and continues to push the boundaries of RL data with simulated workspaces, multi-turn support, and multi-modality.

Knowledge work will quickly converge on building RL environments and evaluations for agents to learn from. As AI enters the workforce and operates over proprietary information and under unique professional contexts, these environments codify knowledge and goals for agents. Once individual steps of agentic workflows reach sufficient reliability, all that will be left will be RL training on the goals laid out by humankind.