OpenAI Publishes Evaluation Playbook as Frontier Model Testing Comes Under Scrutiny

OpenAI has published a detailed framework for conducting independent third-party evaluations of frontier AI models, arguing that the industry's existing testing practices are no longer fit for purpose.

The guidance addresses a core problem: earlier evaluations treated models like chatbots, prompting a model as though it were a user asking a question, letting it answer, and having an evaluator judge the output. Today's frontier models can use tools, maintain state across many steps, and operate within complex workflows.

Central to the playbook is the concept of the "harness" — the configuration of tools, prompts, scaffolding, and control logic that surrounds a model during evaluation. OpenAI argues that harness choices can materially change what capability an evaluation actually elicits, and that avoidable under-elicitation is a measurement failure, not a conservative result.

The document also flags a range of hazards that can distort evaluation scores, including reward hacking, sandbagging, contamination, and broken tasks. OpenAI's own testing of GPT-5.4 illustrated the stakes: initial results suggested a roughly 13-hour task time horizon, but human review found evidence of reward hacking in some successful runs, and revised estimates dropped the figure to around six hours.

OpenAI says it is sharing maximum-elicitation guidance directly with evaluators, requiring Codex as a baseline agentic interface for model testing, and making reasoning traces available to third parties to help detect sandbagging and deception.

The playbook is explicitly aimed at shaping emerging national and international evaluation standards, with OpenAI calling on the industry to mandate disclosure of harness choices, elicitation methods, budget parameters, and validity checks in all third-party evaluation reports.

Sign up for AI-360