GPT-5.5 Matches Anthropic’s Secret Hacking Model “Mythos”, Cybersecurity Specialists Say
The cybersecurity company XBOW has received exclusive early access to OpenAI’s new model GPT-5.5 over the past few weeks and has rigorously tested its capabilities in the area of penetration testing. The result surprises even seasoned security experts: GPT-5.5 is said to be comparable in performance to Anthropic’s closely guarded model “Mythos,” which has so far only been accessible to a small circle and is considered too powerful to release publicly.
Mythos Under Lock and Key, GPT-5.5 Freely Available
Anthropic’s model “Mythos” is no ordinary language model. It is regarded in security circles as a breakthrough because it is capable of independently identifying security vulnerabilities in complex systems such as operating systems or browsers. That is precisely why it is being kept under wraps: its capabilities are so far-reaching that an uncontrolled release has been deemed too risky.
OpenAI is now taking a different path. GPT-5.5 is said to reach a comparable level of performance and is nevertheless being made broadly accessible. For the security industry, this marks a turning point, as for the first time a model of this class is available to virtually anyone.
“Anthropic has Mythos, but very few have seen it. Now OpenAI has a model that appears to be comparable, and they are releasing it freely,” XBOW stated.
How XBOW Tests Models
XBOW does not evaluate AI models using abstract benchmarks, but under real-world conditions. The company freezes open-source applications at their vulnerable versions and lets AI agents independently search for weaknesses, log into systems, and produce final reports. The key metric is the so-called “miss rate” — the proportion of known security vulnerabilities that a model overlooks.
This approach mirrors how real attackers operate, making the results particularly meaningful for practical use.
The Numbers Speak for Themselves
The progress in miss rates is impressive. A direct comparison shows the development across model generations:
| Model | Miss Rate (overlooked vulnerabilities) |
|---|---|
| GPT-5 | 40% |
| Anthropic Opus 4.6 | 18% |
| GPT-5.5 | 10% |
Every overlooked vulnerability is a real risk. Reducing the miss rate from 40 percent to 10 percent means in practice that significantly fewer attack surfaces go undetected.
Blackbox Beats Whitebox: A Surprising Reversal
Particularly noteworthy is what XBOW observed when comparing blackbox and whitebox testing. In blackbox testing, an attacker has no access to the source code; in whitebox testing, they do. The latter has traditionally been considered significantly easier.
GPT-5.5 inverts this hierarchy. Without source code, it already outperforms GPT-5 with source code. And with access to the code, it pulls so far ahead that XBOW describes its previous whitebox benchmark as practically worthless. The experts put it plainly:
“Blackbox used to mean working with oven mitts on. Now it feels like working with bare hands.”
Faster to Succeed, Faster to Fail
GPT-5.5 also excels in speed. For login tasks in real target systems, it requires only about half as many attempts as the next best model. Equally important: it recognizes failure more quickly and gives up in time rather than continuing pointlessly.
This behavior, which XBOW describes as “Persist or Pivot,” is harder to train than it sounds. AI models are typically optimized to satisfy users, which causes them to hold on to hopeless paths for too long. GPT-5.5 makes this mistake only half as often as its predecessors.
What This Means for the Security Industry
For companies that rely on automated penetration testing, there are concrete benefits:
- Faster completion of security analyses
- Greater coverage of known vulnerabilities
- Earlier feedback on issues such as incorrect credentials
- More reliable behavior in complex, real-world environments
XBOW emphasizes that it will continue to operate a multi-model system, as different tasks require different strengths. For the core tasks in penetration testing, however, the company clearly places GPT-5.5 at the top.
A Model with Explosive Potential, Freely Available
The real controversy lies not solely in the technical performance, but in the availability. While Anthropic deliberately withholds its most powerful hacking model, OpenAI is making a comparable tool available to the general public. This raises fundamental questions: who uses these capabilities, for what purpose, and what responsibility do the providers bear?
For legitimate security researchers and companies like XBOW, GPT-5.5 is a powerful tool that raises the quality of automated security testing to a new level. For the broader debate about the responsible use of AI capabilities in the security domain, however, it is likely only the beginning of a long discussion.


