Test-Time Compute Evaluations

Test-time compute evaluations measure model performance as a curve over inference budget instead of as a single benchmark number. Noam Brown argues that as LLMs become more capable, benchmark performance increasingly depends on how many tokens, dollars, or seconds are spent at inference; therefore a scalar score can hide the real capability gap between models.^{source: noam-brown-test-time-compute-evaluations-2026.md}

The core empirical claim is that capability plateaus may be far beyond practical budgets, and stronger models may be better at converting longer horizons and larger inference budgets into performance. This makes it hard to know the ceiling of modern LLMs, because evaluating every model at extremely high budgets across thousands or millions of rollouts is too expensive.^{source: noam-brown-test-time-compute-evaluations-2026.md}

The proposed evaluation shape is a performance-vs-budget plot. Tokens are convenient but not directly comparable across tokenizers or model economics; dollars reflect practical cost but depend on batching and hardware utilization; wall-clock time is intuitive but undercounts parallel best-of-N and multi-agent scaffolds. Brown's conclusion is not that one x-axis is perfect, but that any explicit budget curve is more informative than a context-free score.^{source: noam-brown-test-time-compute-evaluations-2026.md}

This connects to agent-loops and harness-engineering because scaffolds can spend more inference compute without changing the underlying base model. A release like Deep Think may expose capabilities that were already reachable by external users willing to pay for a scaffold; the policy question becomes whether system cards and preparedness frameworks evaluate the base model, the productized scaffold, or projected capabilities at much larger budgets.^{source: noam-brown-test-time-compute-evaluations-2026.md}

For AI preparedness, the page's key implication is that capability thresholds should specify inference budgets. Labs should report the budget used for scalar benchmark results, leaderboards should track inference usage or enforce explicit token/cost/time limits, and Responsible Scaling Policies should estimate capabilities at multiple budgets with stated uncertainty. Long-horizon agents complicate this further: if an agent can operate over a one-year horizon, a complete evaluation of that horizon may exceed the model development cycle itself.^{source: noam-brown-test-time-compute-evaluations-2026.md}

ENPIRE adds a physical-resource version of this idea. In robot autoresearch, the evaluation budget includes robot fleet size, robot utilization, GPU utilization, token throughput, wall-clock time, and total tokens-to-success. Fan's follow-up states the priority order bluntly: robot-seconds are scarcest, then GPU-seconds, then tokens. MRU, MTU, GPU utilization, Tokens-to-Success, and Time-to-Success make the same point as budget curves in LLM evaluation: a final success rate is incomplete without the resource envelope used to obtain it.^{source: jim-fan-physical-autoresearch-loopcraft-2026.md}

Test-Time Compute Evaluations

Resources