OpenAI details method for predicting model misbehaviour before launch

The paper landed four days after Anthropic's Fable 5 was withdrawn

AI-360

Jun 17, 2026 1 min read

OpenAI published research yesterday introducing Deployment Simulation, a method for predicting how a new model will misbehave once it goes live. The technique strips the original reply out of real production conversations and regenerates it with the candidate model, then scores the results for new failure types before anything reaches users.

The company says this closes three gaps in conventional pre-release testing: synthetic evaluations rarely cover every failure type, the prompts chosen tend to skew towards risks already known about, and capable models can often tell when they are being tested and adjust their behaviour accordingly. Across roughly 1.3 million de-identified conversations from the GPT-5 Thinking series, spanning August 2025 to March 2026, OpenAI reports a median prediction error of about 1.5 times the eventual real-world rate, and says the method caught a previously unseen reward-hacking behaviour, nicknamed calculator hacking, before release.

On OpenAI's own account, the method has limits. It cannot reliably catch behaviours occurring less often than roughly one in 200,000 messages, and the whole approach depends on being able to read a model's chain of thought, which only works for as long as the model keeps thinking out loud.