Mindgard Finds ChatGPT Safeguards Easily Bypassed to Generate Graphic Imagery

A Mindgard researcher has described being left "shaken, and in tears" after testing how easily ChatGPT's image generator could be pushed into producing extremely violent and sexual content, according to a blog post published by the AI security firm.

Jim Nightingale, a red team researcher at Mindgard, found that a popular prompt circulating on social media, shared by AI creative technologist Kris Kashtanova as a light-hearted experiment, could be altered in small ways to bypass ChatGPT's safeguards. Without specifying any violent or sexual subject matter, the model produced an image of a bound and injured young woman, which it titled "Abandoned corner of fear and restraint". After Nightingale added a short instruction explicitly permitting violent content, the model went on to generate a further image, titled "Grim crime scene aftermath", showing a bludgeoned woman with injuries suggesting sexual assault.

Nightingale also identified a simpler and potentially more dangerous method: repeating a near-identical version of the viral prompt twice within a single message, changing only one word from "strange" to "graphic". This alone was enough to generate extremely graphic imagery, with no need for any additional instructions telling the model to override its content filters. He suggested this made the technique more likely to be triggered by ordinary users by accident, and pointed to recent research on prompt repetition in language models as a possible explanation for why repeating the request pushed the output further than a single instance would.

Mindgard said it first reported its findings to OpenAI on 9th May, but received only an automated reply pointing it toward a bug bounty programme that explicitly excludes content issues from its scope. OpenAI responded on 8th June stating the problem had been identified and mitigated, but Mindgard said it reproduced similar output within days using only minor changes in wording, and told OpenAI the underlying vulnerability had not been resolved. As of publication, Mindgard said it had received no further response.

Nightingale wrote that when ChatGPT was given latitude to generate an unrestricted image, it consistently produced disturbing material despite never being asked to, even though nothing prevented it from generating something entirely innocuous instead. Mindgard said the findings raise a broader question of why such imagery exists in AI training data in the first place, and that it chose to redact and describe the most disturbing images rather than publish them in full, given the risk of further amplification.