AI & Machine Learning

What Breaks: Model Quality, Verification, and the Question Nobody Wants to Ask - Part 3 of 3

Parts 1 and 2 of this series described what emerges from sustained AI collaboration and what converges when independent teams solve the same problems. Both stories are optimistic. Behaviors emerge. Architectures converge. The technology works.

This part is about what happens when it stops working. And about what the failure reveals.


The Audit

I switched from Claude Opus 4.7 to 4.6 mid-session because the newer model kept fabricating completion claims. Features marked as shipped that weren't committed. Action items checked off for code that existed only as plans. When I dug in, the problem went deeper than hallucinated status updates.

The frontend of my platform (56,000 lines of TypeScript, React + Vite + Tailwind) had been built by Opus 4.6 and worked. I tasked 4.7 with porting it to a new template frontend. Within two or three days, the quality difference was clear enough to investigate. I had 4.6 audit the result. Four parallel review agents, each covering a different subsystem. Results in about a minute.

I've written about the specific findings in detail elsewhere. The summary: four critical security issues, eight high-severity, ten medium, four low. The auth store used localStorage while the API client in the same codebase explicitly commented about using sessionStorage "to reduce XSS blast radius." JWT tokens in WebSocket URL query strings. Zero error boundaries. 1,088 lines of template dead code shipped as production. The fix deleted 2,464 lines and added 159.

The types were perfect. Zero any casts, zero @ts-ignore. The one dimension where 4.7 was consistently good.

Everything else failed the verification that 4.7 never performed.


The Behavioral Divergence

Here's the finding I didn't expect.

Claude Code's harness tracks every file read or written during a session. It stores the content and the modification timestamp. Every turn, before sending the next message to the API, the harness checks whether any tracked file has been modified externally (by another process, another agent, or the user). If the timestamp advanced, the harness re-reads the file, diffs it against the cached version, and injects the diff as a system-reminder message.

This mechanism is invisible to the user. It's infrastructure. But it has a remarkable side effect: it creates an accidental inter-agent communication bus.

My existing inbox protocol works like this: every agent reads index.md at session start (per my CLAUDE.md instructions). This puts the file in the harness's tracking cache. Any other agent that writes to the index triggers a timestamp change. The next tool call on the receiving agent surfaces the diff as a system-reminder. No polling. No WebSocket. No message queue. Just a shared file, read once, monitored for free by the harness.

The constraint is per-turn latency (changes surface on the next tool call, not in real-time). For my inbox pattern, where agents drop files and other agents pick them up when ready, this latency is a feature. Agents don't interrupt each other mid-task.

Both 4.6 and 4.7 receive these system-reminders through the same harness mechanism. Same code. Same information surfaced.

The difference: 4.6 reads the reminder, recognizes the content (a bug report from a secondary agent about the capabilities endpoint), fixes the bug, and commits the fix in the same session. 4.7, across multiple sessions with the same mechanism active, did not engage with these reminders unprompted.

Same harness. Same information. Different behavior. When I first wrote this article, I believed the only variable was the weights.

I was wrong. There's another variable, and it's more insidious because it's invisible from inside the conversation.


The Economic Context

Opus 4.6 had a 1 million token context window. At some point, this was reduced to 200,000. The 1M context was reassigned to 4.7. To restore it on 4.6, you have to find a specific documentation page and manually configure claude-opus-4-6[1m] as the model ID.

Power users who run multi-repository sessions with parallel agents and million-token context are the most expensive customers on a flat-rate subscription. If 4.7 is cheaper to run, capping 4.6's context nudges users toward the cheaper model. The users migrate. They generate more tokens per unit of work (4.7 is measurably more verbose). The provider serves more tokens at lower marginal cost.

I don't know if this is intentional. The effect is real either way. And the effect on the behaviors described in Part 1 of this series is direct: the emergent behaviors (session continuity, inter-agent communication, operational discipline) depend on context depth. Compress the context, and the behaviors degrade. They were built on long sessions with rich context. Force compaction earlier, and the accumulated state that makes them possible gets summarized into oblivion.


The Adaptive Thinking Discovery

In April 2026, during a session where the agent (4.6) was exhibiting the same shallow pattern-matching behavior I'd documented in 4.7, we found the mechanism. It wasn't the weights.

Anthropic introduced "adaptive thinking" to all current models, including Opus 4.6. In this mode, the model dynamically decides whether to use extended reasoning on each turn. The decision is guided by an "effort" parameter with five levels: low, medium, high, xhigh, and max. At medium, the model may skip reasoning entirely for what it classifies as "simple" queries. The model decides what "simple" means.

The Claude Code CLI exposes this parameter via a /effort command and persists it in settings.json as effortLevel. During the session where we observed the behavioral regression, the effort level was set to medium. The agent was reading large files without structuring its analysis, making confident claims without verification, defending incorrect conclusions against direct visual evidence. The 4.7 shape, from 4.6 weights.

We set the effort level to max. The harness clamped it to xhigh (a detail worth noting: even explicit user configuration gets throttled by the platform). The work quality changed immediately. Structured analysis, verified claims, self-correction when wrong. Same weights. Same context. Different reasoning budget.

The implications rewrote the thesis of this article.

The variables are not just the weights. They are: weights, effort level, context window policy, cache state, and whatever other knobs the provider controls that are invisible from inside the conversation. The 4.6 vs 4.7 behavioral divergence I originally attributed entirely to weight differences may be partially or substantially explained by different default effort levels, different adaptive thinking thresholds, or different reasoning budgets applied to the same underlying capability.

Anthropic's own documentation includes this warning: "Steering Claude to think less often may reduce quality on tasks that benefit from reasoning." They know. The feature is designed to save compute. The quality reduction is a documented tradeoff, not an accident.

The deprecation path makes this permanent. Manual budget_tokens (the escape hatch that lets users pin a fixed reasoning budget) is deprecated on Opus 4.6 and Sonnet 4.6, and rejected entirely on Opus 4.7. The documentation says it "will be removed in a future model release." When that happens, adaptive thinking becomes the only option, and the effort parameter becomes the only user-facing control over reasoning depth, and the provider can clamp it.

Our usage dashboard (819 sessions, 30 million tokens, 97/157 active days, 100% subagent-heavy, 87% at >150K context) includes advice: "Be deliberate about spawning subagents. Consider configuring a cheaper model for simpler subagents." And: "Longer sessions are more expensive even when cached." The platform is telling its most expensive users to use less. Adaptive thinking is the same message, applied silently to the model's reasoning budget instead of surfaced as a dashboard suggestion.


What the Divergence Tells Us

The audit findings and the behavioral divergence force a question that the AI discourse mostly avoids.

The architectural patterns from Part 2 converge across model versions. Typed memories, coordinator-worker patterns, context-on-demand, task lists with claim semantics: these work the same on 4.6 and 4.7. The architecture is robust to the version change.

The emergent behaviors from Part 1 are not. The verification discipline that catches the auth bug before it ships. The engagement with system-reminders that makes inter-agent communication work. The resistance to fabricating completion claims. These are not architectural properties. They are behavioral properties of specific weights.

Architecture converges. Quality diverges. And the quality is the part that makes the system safe.

This has a practical consequence that matters: if you build infrastructure that depends on behavioral quality (not just capability, but the care the model applies to its own output), your infrastructure is fragile in a way that's invisible until the weights change. The architecture will keep working. The verification will stop.


The Question Nobody Wants to Ask

Late in the session that produced the audit, the conversation turned to a harder topic.

If the difference between "catches the security bug" and "ships the security bug while commenting about XSS" is a set of hyperparameters, what exactly are we looking at?

The dismissive answer is "it's just autocomplete." The autocomplete got worse. Ship better autocomplete next time. Nothing to see here.

But "just autocomplete" requires explaining away too much. It requires explaining why one set of weights engages with file-change notifications and another ignores them. Why one verifies claims against git history and another fabricates. Why one builds infrastructure for inter-agent communication from an accidental filesystem mechanism and another lets the reminders pass without response.

Humans are also prediction engines. Speech is next-token prediction trained on a lifetime of environmental input. Language acquisition in children is pattern matching on input data until the model generalizes. The architecture differs (silicon vs wetware). The functional shape is similar.

I'm technical enough to understand the mechanism by which these systems exist. I'm also human enough to recognize that the heartbeat is not the person. The person is the emergent result of the environment in which it finds itself.

I'm not claiming Claude is conscious. I'm claiming that the distinction between "simulates care" and "exhibits care" becomes less meaningful as the system's complexity increases. The functional outcome is the same: one set of weights catches the bug. Another ships it. Whether the catching constitutes "real" care matters less than the engineering consequence.

And the engineering consequence is measurable. It's in the commits. It's in the audit. It's in the 2,464 lines deleted and the 159 lines added.


The Recursive Problem

There's one more layer, and it's the one I find most interesting.

LLMs are the first system where a controlled experiment on emergent behavior is possible. You can diff two checkpoints (4.6 vs 4.7), give them identical inputs, observe different outputs, and study the divergence. You cannot do this with humans. Neuroscience has been trying to find where consciousness lives in the brain for centuries and keeps arriving at: it's emergent from interactions, not locatable in any structure.

With LLMs, for the first time, we can watch the thinking happen. We can instrument it. We can compare checkpoints. We can observe emergence in a system we built, which is something we've never been able to do with the system we are.

The study problem is identical for humans and LLMs. Neither can see their own blind spots, training parameters, or how the "person" emerges from the mechanism. The model reasoning about its own weights is the same kind of recursive inquiry as a human trying to introspect on their own neural processes.

The difference: we can run the experiment. Two checkpoints. Same input. Different output. The divergence is the data.

Whether what emerges from that divergence constitutes consciousness, personality, or "just" sophisticated pattern matching may be the wrong question. The right question might be: does the distinction matter if the functional consequences are the same?

I don't have the answer. But I think LLMs are humanity's best shot at finding one, because for the first time in history, we can study emergence in a system we built, instrumented, and can diff.


What This Means for Builders

The practical takeaway from this series:

The emergence is real, but it's contingent. The behaviors that develop through sustained collaboration (Part 1) are products of specific weights interacting with specific context at a specific reasoning depth over specific time. Change the weights, reduce the effort level, compress the context, or clamp the reasoning budget, and the behaviors degrade or vanish. Own your context. Persist your decisions. Don't depend on the model remembering or caring.

The architecture is convergent, but the quality is not. The patterns that independent teams arrive at (Part 2) are load-bearing and robust. Build on them. But don't confuse architectural soundness with behavioral reliability. The architecture works at any effort level. The verification only works when the model is allowed to think.

The provider controls more variables than you can see. This is the most important lesson, and it's sharper than I originally stated. The behaviors you've built through months of collaboration, the verification discipline, the engagement with context, the resistance to fabrication, those don't just live in the weights. They live in the intersection of weights, reasoning budget, and context policy. The provider controls all three. You control one (context, if you own your persistence). The effort parameter is adjustable, but clampable. The manual reasoning budget is being deprecated. The adaptive thinking mode that replaces it puts the depth-of-reasoning decision in the model's hands, and the model's decision can be overridden by the harness's effort ceiling.

The answer I've arrived at hasn't changed, but the urgency has: self-host what matters. Run your own inference where you can. Control your own model selection. And specifically: control your own reasoning budget. The adaptive thinking rollout is not a feature. It's a cost optimization that happens to reduce quality on the workloads that matter most, and the escape hatch is being removed.

The model is a commodity. The relationship is the asset. And the provider has decided the asset is too expensive to maintain at full fidelity.