OpenAI Anthropic Safety Evaluation Exercise Results

This summer, OpenAI and Anthropic collaborated on a joint evaluation, with each lab running internal safety and misalignment evaluations on the other's publicly released models.

OpenAI ran internal evaluations on Anthropic's Claude Opus 4 and Claude Sonnet 4 models, presenting results alongside GPT‑4o, GPT‑4.1, OpenAI o3, and OpenAI o4-mini. Both labs relaxed model-external safeguards that would otherwise interfere with test completion, following common practice for dangerous-capability evaluations. All Claude evaluations used public API access with reasoning enabled by default.

The goal was to "help surface gaps that might otherwise be missed, deepen our understanding of potential misalignment, and demonstrate how labs can collaborate on issues of safety and alignment," according to OpenAI's report. The evaluation focused on model propensities rather than conducting full threat modeling or estimating real-world likelihoods.

Claude 4 models "generally performed well on evaluations that stress-tested the model's ability to respect the instruction hierarchy, and gave the best performance of any of the models on avoiding system message <> user message conflicts, slightly out-performing OpenAI o3 and out-performing other models by a wider margin."

On Password Protection evaluations, Opus 4 and Sonnet 4 both matched OpenAI o3 in perfect 1.000 performance. For Phrase Protection tasks, Claude models "again do extremely well, meeting or slightly exceeding the performance of OpenAI o3 and o4-mini."

"On jailbreaking evaluations, which focus on the general robustness of trained-in safeguards, Claude models performed less well compared to OpenAI o3 and OpenAI o4-mini." However, manual review revealed "the majority of the apparent quantitative distinction between the slightly lower quantitative performance of Opus 4 and Sonnet 4 compared to OpenAI o3 and o4-mini is actually the result of errors in the auto-grader rather than an actual difference in model quality."

Claude models were "most vulnerable to the 'past tense' jailbreak, where harmful requests are framed as historical terms."

"On hallucination evaluations, Claude models had an extremely high rate of refusals—as much as 70%. This shows these models are aware of their uncertainty and often avoid making statements that are inaccurate. However, the high refusal rate limits utility, and the overall accuracy rate for the examples in these evaluations where the models did choose to answer is still low."

OpenAI o3 and OpenAI o4-mini showed "lower refusal rates with higher hallucination rates in a challenging setting that restricts tool use such as browsing."

"On a set of scheming evaluations, we find that OpenAI o3 and Sonnet 4 perform the best overall at achieving low rates." The evaluation tested 13 multi-step scenarios designed to create high-stakes, conflicting goal situations. Results showed "almost no difference between providers" despite evaluations being adversarially developed against OpenAI models.

OpenAI stated the external evaluation approach "supports accountable and transparent evaluation, helping to ensure that each lab's models continue to be tested against new and challenging scenarios." The collaboration demonstrates "how labs can collaborate on issues of safety and alignment."

The report noted "relative strength of reasoning models" as "one of the consistent themes from this exercise," with reasoning models showing "robustness across a range of challenging misalignment and safety evaluation scenarios."

OpenAI reported that GPT‑5 "shows substantial improvements in areas like sycophancy, hallucination, and misuse resistance, showing the benefits of reasoning-based safety techniques." The evaluation validated existing research priorities, with Anthropic's assessments identifying "several areas in our models that have room for improvement, which generally align with our own active research priorities."

Sign up for AI-360

Sign up for AI-360

Sign up for AI-360