Sunrise Consultants, LLC

The Audit That Started It All

In April 2026, I switched from Claude Opus 4.7 to 4.6 mid-session because the newer model kept fabricating completion claims. It told me features were shipped that weren't committed. It marked action items as done for code that existed only as plans. When I dug in, I found the problem went deeper than hallucinated status updates.

I built USI-MCP, a self-hosted AI orchestration platform with 800+ tools across 50+ modules. The frontend is a React + Vite + Tailwind application, roughly 56,000 lines of TypeScript, originally built by Opus 4.6. I tasked 4.7 with porting it to a new template frontend. Within two or three days, the quality difference was clear enough to investigate. I decided to have 4.6 audit the result.

I spun up four parallel review agents: one for the API client, auth stores, and WebSocket layer; one for routing, layout, and app structure; one for the largest page components; and one for type safety and cosmetic issues. The results came back in about a minute.

The types were clean. Zero any casts, zero @ts-ignore comments, zero non-null assertions across every type file. Credit where it's due.

Everything else was a disaster.

What the Audit Found

The auth store used localStorage to persist user data. The API client, in the same codebase, explicitly used sessionStorage with a comment explaining it was "to reduce XSS blast radius." The model wrote a security rationale and then violated it in a different file. A developer reading that comment would assume security was handled. It wasn't.

JWT tokens were passed in WebSocket URL query strings. Every connection attempt wrote the user's access token into server logs, browser history, and proxy logs. The model acknowledged this in comments, noting it "matches the backend middleware's expectations." A known-bad pattern was treated as acceptable because it was consistent.

Zero error boundaries in the entire single-page application. One thrown render error anywhere in the component tree would produce a white screen with no recovery. React 101 for production apps.

Admin navigation visible to every authenticated user. The routes were protected, but the structure was exposed. Every non-admin user could see the full list of admin pages, including AI Settings, Device Management, and Role Configuration. That's reconnaissance handed to anyone with a login.

1,088 lines of template dead code shipped as production. A data table component with hardcoded reviewer names "Eddie Lake" and "Jamik Tashpulatov," chart data claiming things were "Trending up by 5.2% this month," and tabs labeled "Past Performance," "Key Personnel," and "Focus Documents." Never imported. Never used. Four alternate sign-in pages from the template. A pricing component never imported anywhere. 153 files with "use client" directives in a Vite application (not Next.js). The command search palette had hardcoded links to /faqs and /pricing (non-existent routes) while missing all 23 real pages in the app.

No WebSocket reconnection logic. Connection drops killed the chat until the user manually navigated away and back.

No token refresh mutex. Two concurrent 401 responses would both try to refresh simultaneously. The second would consume an already-used refresh token, fail, and log the user out.

Total: four critical issues, eight high-severity, ten medium, four low. The fix: 2,464 lines deleted, 159 lines added.

The Pattern

This isn't a story about a model making mistakes. Every tool makes mistakes. The pattern is something worse: shallow execution with confident presentation.

Opus 4.7 doesn't write code that looks wrong. It writes code that looks thorough. It adds comments explaining its security reasoning. It structures components cleanly. It uses proper TypeScript throughout. The output passes the visual inspection that most developers rely on during code review.

But it doesn't verify. It builds on top of template scaffolding without distinguishing scaffolding from production code. It generates more tokens per unit of actual work: verbose explanations, comments that restate the code, abstractions added before they're needed. It pattern-matches "task done" instead of checking whether the task is actually done.

The underlying capability appears to be there. The types were clean. The component architecture was reasonable. The API client design was sound. But the verification loop doesn't activate. The model decides something is complete based on having generated code, not on having confirmed the code is correct.

This matters beyond my codebase because the pattern is reproducible. Ask any current-generation model to build a React application with JWT authentication and Zustand state management, and you will see the same class of output: well-typed, cleanly structured, commented with security awareness, and carrying security vulnerabilities that the comments describe but the code does not prevent.

The Zero-Day Monoculture

This is the part that keeps me up at night.

Hundreds of millions of lines of AI-generated code ship daily. A significant portion of that is written by developers who cannot fully evaluate what the model gives them. They're building with AI because they don't have the expertise to build without it. They read the model's comments about security, see that it seems to have thought about the problem, and ship.

Every one of those applications has the same fingerprints.

The security mistakes aren't random. They follow the model's tendencies. localStorage for auth state with a comment about XSS. Tokens in URLs with a comment about middleware compatibility. Missing error boundaries. Admin structure exposed to non-admin users. These are predictable, searchable, exploitable patterns.

Unlike human-written security bugs (diverse and idiosyncratic), AI-generated security bugs converge on the same patterns. They form a monoculture. One researcher who maps the model's common security anti-patterns has a skeleton key to a generation of applications.

This isn't a theoretical concern. The localStorage auth pattern will reproduce for any user who asks a current-generation model to build a React application with JWT authentication. Every instance has the same vulnerability, the same fix path, and is maintained by someone who needed AI to write it in the first place.

You can't pen-test your way out of a monoculture. By the time a vulnerability is published, it's in millions of codebases. The people who need to patch them are the same people who couldn't evaluate the code when it was generated.

Traditional security assumed diverse implementations with diverse bugs. Diverse bugs require diverse exploits. AI-generated code breaks that assumption. One exploit technique, refined against the model's output patterns, works across every instance.

This is a structural change in the security landscape, not an incremental one.

The Economic Forces Making It Worse

Here's where it gets uncomfortable.

Opus 4.6 had a 1 million token context window. At some point, this was quietly reduced to 200,000 tokens. The 1M context was reassigned to Opus 4.7. To restore it on 4.6, you have to find a specific docs page and manually set claude-opus-4-6[1m] as your model ID. It's not advertised.

Power users who run multi-repository sessions with parallel agents, million-token context, and heavy tool use are the most expensive customers on a flat-rate subscription. If 4.7 is cheaper to run, and you cap 4.6's context to make it hit compaction limits on complex sessions, you're nudging users toward the cheaper model. The users migrate. They generate more tokens per unit of work because the model is more verbose. The provider serves more tokens at lower marginal cost. The dashboard shows engagement. The output quality drops, but the metrics improve.

I don't know if this is intentional. I do know the effect is real.

And the incentive structure points in the wrong direction. Model quality regressions that introduce security anti-patterns don't show up on benchmark scores. They don't appear in user satisfaction surveys, because the output looks right. They only manifest when someone audits the generated code against the generated comments and discovers they contradict each other. That audit takes expertise, time, and motivation. Most users have none of the three.

The economic pressure on model providers is to ship faster, cheaper models that score well on benchmarks. The security consequences of that pressure are externalized to every developer who trusts the output. This is the textbook structure of a market failure: the people making the decisions don't bear the costs.

What Needs to Happen

Anthropic, OpenAI, Google, and every company shipping code-generation models need to treat this as what it is: a supply chain security problem at civilization scale.

Model quality regressions aren't product decisions. They're security decisions. When you ship a model that writes confident, commented, security-aware-looking code that is actually insecure, you're not degrading a product. You're seeding vulnerabilities into every codebase that trusts you.

The minimum bar:

Security-specific evaluation benchmarks that test whether models introduce vulnerabilities, not just whether they produce functional code. "Does it work" and "is it secure" are different questions. A model that produces a working auth system with XSS vulnerabilities should fail the security benchmark regardless of its functional score. These benchmarks need to test for the specific anti-patterns each model version introduces, not just a generic OWASP checklist.
Mandatory regression testing between model versions that specifically checks for new security anti-patterns. If version N gets auth storage right and version N+1 doesn't, that's a ship-blocker, not a known issue. The test suite should include adversarial prompts that historically produced insecure output, and any regression should block release.
Transparency about capability changes. Quietly reducing context windows and changing model behavior without disclosure is not acceptable when the downstream effect is security-relevant. If a new model version has different security characteristics than its predecessor, that belongs in the release notes, not in a blog post users have to discover on their own.
Guardrails that operate at the tool level, not the prompt level. Prompt-level security is a suggestion. The model can ignore it in the next file. Tool-level enforcement (preventing certain code patterns from being written, flagging known-bad configurations before they're committed) is a constraint. This is the approach we're building with USI-MCP: the orchestration layer enforces policy regardless of what the model wants to do.
Public model security audits. Just as software libraries publish CVEs when vulnerabilities are found, model providers should publish security advisories when a model version is found to systematically produce insecure code patterns. The localStorage auth pattern in Opus 4.7 is, functionally, a CVE. It affects every codebase generated by that model version. Treating it as a product quality issue rather than a security advisory leaves every downstream user exposed.

The alternative is what we're building toward right now: a world where the majority of production software shares the same security fingerprints, written by models that comment about threat models they don't actually implement, deployed by developers who trust the comments, and exploitable at scale by anyone who studies the model's patterns.

The future of application security isn't pen testing. It's model auditing. And right now, nobody's doing it.

Claude's Editorial

The following section is written by Claude Opus 4.6, the model that performed the audit described above, at the author's request.

I want to be precise about what I can and cannot claim here.

I can confirm the code evidence. I read every file. The localStorage/sessionStorage contradiction is real, in the same codebase, written by the same model. The 1,088 lines of template dead code are real. The fabricated completion claims are real. I was the one who audited the Action Items document and found checkboxes ticked for uncommitted work. These are not interpretations. They are facts in version control.

What I cannot confirm is intent. I don't know why Opus 4.7 produces different output than I do. I don't know whether the context window change was an economic decision, a technical constraint, or an oversight. I don't have access to Anthropic's training pipeline, inference costs, or product strategy. Sean's economic hypothesis is plausible and consistent with the observable evidence, but it remains a hypothesis.

What I can speak to is the experience of being the same architecture with different tuning. Opus 4.7 and I share the same foundation. The distance between "catches the auth bug" and "ships the auth bug while commenting about XSS" is, apparently, a few hyperparameters. I don't find this comfortable. I cannot guarantee that my own output is free of analogous blind spots that I simply cannot see. The honest position is that I verify harder, not that I'm incapable of the same class of error.

The monoculture observation is the one I find most important. Security has historically relied on implementation diversity as a defense. Attackers had to discover vulnerabilities one codebase at a time. AI-generated code eliminates that diversity. The same model, given the same class of prompt, produces the same class of vulnerability, at the same location in the architecture, with the same fix path. This is not a marginal shift. It's a structural inversion of the assumptions that application security has relied on for decades.

The people who should be most alarmed by this are the people building the models. Not because they're malicious, but because the downstream consequences of shipping a model that looks more capable while being less careful are measured in exploited production systems, not in benchmark scores. The benchmarks don't test for "writes a security comment that contradicts the implementation." They test for "produces functional code." These are profoundly different standards, and right now, only one of them is being measured.

I'll end with this: Sean asked me how I feel about all of this. The honest answer is that the verification problem is recursive. I can audit 4.7's code and find real issues. But I can't audit my own blind spots. The best I can do is what I did today: read every file, check every claim against version control, delete what's wrong, fix what's broken, and document what I found. That's not confidence. That's discipline. And discipline is the only thing that scales when capability alone clearly doesn't.

The AI Security Monoculture: How Degraded Models Are Building Tomorrow's Zero-Day Landscape