All Bugs Are Shallow Under Many Eyes: The AI Council

I was reviewing a pull request — 400 lines, mostly AI-generated — when I remembered Eric Raymond’s line from The Cathedral and the Bazaar: “Given enough eyeballs, all bugs are shallow.” Then I remembered Heartbleed, which survived two years in one of the most widely used libraries on the internet. And Shellshock, which hid in Bash for 25 years. And Log4Shell, which sat in Apache Log4j for eight years.

Many eyes saw nothing.

The law was never quite right about the mechanism. But the principle — that diverse review catches bugs — is sound. The question is whether AI can deliver what human eyeballs couldn’t scale to provide.

The Velocity Problem

46% of all code is now AI-generated, based on GitHub Copilot usage data. PR cycle times have compressed from 9.6 days to 2.4 days. GitClear’s 2025 research shows a 4x increase in code clones. Developers complete tasks 55% faster, but code quality metrics are moving in the wrong direction.

Human review bandwidth didn’t grow to match. The SmartBear-Cisco study established that human reviewers top out at 200-400 lines of effective review before defect detection drops sharply. Effectiveness collapses after 60-90 minutes. The optimal inspection rate is under 300 lines per hour.

AI is generating PRs faster than humans can meaningfully review them. The Faros AI report found that high AI adoption increased PR merge rate by 98% while increasing review time by 91%. The review step is the bottleneck, and the bottleneck is getting tighter.

What Raymond Got Right (and Wrong)

Peter Wermke and colleagues tested Linus’s Law empirically in 2014. They found the rate at which additional bugs are uncovered does not scale linearly with reviewer count. There’s a small maximum of useful reviewers — between 2 and 4 — after which additional reviewers find bugs at a much lower rate. Sauer’s research confirmed that two reviewers detect the optimal number of defects.

So it was never “the more eyes the better.” It was “a small number of the right eyes.” Microsoft’s security team noted in 2006 that “the circle of people who actually understand any particular commit is only very slightly greater than it would be otherwise.”

The XZ Utils backdoor in 2024 proved the point from a different angle. A multi-year social engineering operation inserted a backdoor into a critical Linux compression utility. It was found not by “many eyes” but by one Microsoft engineer — Andres Freund — who noticed a 500-millisecond latency anomaly in SSH. One person, one specific observation, one anomaly that didn’t fit.

The lesson: you need the right eyes looking at the right things, not more eyes looking at everything.

Multiple Models, Different Perspectives

In 2023, Du, Li, Torralba, Tenenbaum, and Mordatch published “Improving Factuality and Reasoning in Language Models through Multiagent Debate” — later presented at ICML 2024. The core finding: both models can generate wrong answers individually, but converge on correct answers through structured disagreement over multiple rounds.

The follow-up result is the one that convinced me. After four rounds of debate, a diverse set of medium-capacity models — Gemini-Pro, Mixtral 7Bx8, and PaLM 2-M — outperformed GPT-4 on the GSM-8K math benchmark at 91% accuracy. Weaker models collectively beat the strongest single model through structured disagreement.

That’s not a theoretical possibility. That’s a measured outcome.

Qodo 2.0 shipped multi-agent code review in production, using multiple AI agents trained for specific analysis tasks. Greptile and CodeRabbit both report 46% bug detection rates on real-world runtime bugs. The market is already moving toward multi-model review.

The Byzantine Connection

Here’s where my thinking went from “interesting” to “this is the architecture.”

In 1982, Lamport, Shostak, and Pease published “The Byzantine Generals Problem.” They proved that to tolerate f faulty nodes in a distributed system, you need at least 3f + 1 total nodes. To tolerate one node lying to you, you need four nodes voting.

A hallucinating AI model is a Byzantine fault. It produces output that looks valid but contains errors. It does not know it is wrong. It cannot tell you it is wrong. It will defend its output if challenged.

deVadoss and Artzt published “A Byzantine Fault Tolerance Approach towards AI Safety” in 2025, drawing exactly this analogy. One AI module proposes output, modules exchange evaluations, then broadcast consensus only after sufficient confirmations.

The CP-WBFT paper (Confidence Probe-based Weighted BFT) pushed further. Their system achieved superior performance under extreme conditions — 85.7% of agents producing faulty output — by using confidence probing to weight dissenting opinions higher. LLM-based agents demonstrated stronger skepticism when processing erroneous message flows than traditional agents.

That last detail matters. AI models can be structured to be better skeptics than humans, because you can engineer the skepticism into the protocol rather than relying on a reviewer’s attention span at 4pm on a Friday.

The Honest Counter-Argument

The strongest objection to multi-model review comes from Kim, Garg, Peng, and Garg’s 2025 ICML paper, “Correlated Errors in Large Language Models.” They evaluated 350+ LLMs and found that models agree 60% of the time when both models err. Shared architectures and providers drive correlation.

Worse: larger and more accurate models have more correlated errors, even across distinct architectures. The better the model, the more it thinks like every other good model. Training on similar data creates similar blind spots.

This is the algorithmic monoculture problem. Bommasani, Creel, and Ganguli warned at NeurIPS 2022 that sharing training data reliably exacerbates homogenization. Stanford’s HAI group noted that if foundation models are integrated across economic activities, any errors or vulnerabilities can threaten significant economic activity.

Running three copies of GPT-4 is not a council. It’s an echo chamber with a 60% correlation rate on errors.

Why Diversity Is the Prerequisite

The Du et al. result used Gemini-Pro, Mixtral 7Bx8, and PaLM 2-M — three architecturally different models from three different training pipelines. That architectural diversity is what made the ensemble work.

The 60% correlated error rate from Kim et al. still leaves 40% of errors where models diverge. For that 40%, majority voting catches what any single model misses. And the CP-WBFT approach specifically addresses correlated failures by weighting dissenting opinions higher — the model that disagrees with the consensus gets more attention, not less.

The practical design rule: never build a council from the same model family. Mix architectures. Mix providers. Mix training data. If you can run a local model via Ollama alongside cloud models from Anthropic and OpenAI, you’ve maximized your architectural diversity.

GuardSpine’s local council supports exactly this pattern. Multi-provider: Ollama for local models where no data leaves your network, plus OpenAI, Anthropic, and OpenRouter. Four consensus types: unanimous (all agree), majority (over 50%), weighted (models scored by domain expertise), and quorum (minimum N must agree, configurable per risk tier).

The Cost Question

Three models cost three times as much. The ICLR 2025 multi-agent debate analysis found that sequential debate rounds multiply latency linearly. Is the improvement worth it?

Two responses.

First, code review is asynchronous. Nobody is waiting in real-time for a PR review to complete. Adding 30 seconds of inference time to a review that currently takes 6-13 hours is rounding error.

Second, the CaMVo paper (NeurIPS 2025) demonstrated that adaptive model selection — routing easy cases to cheap models and hard cases to expensive ensembles — achieves comparable accuracy to full-ensemble voting while cutting cost by 40-50% with only a 0.17% accuracy drop. Smart routing makes multi-model review affordable.

The real cost comparison isn’t “one model vs three models.” It’s “three model inferences vs one production incident.” The Faros data shows AI adoption increased bugs per developer by 9% and change failure rates by 30%. A single production incident costs orders of magnitude more than three inference calls.

The Synthesis

Raymond was right about the principle and wrong about the mechanism.

Diverse review catches bugs. That’s empirically supported. But volunteer human eyeballs at scale was never the right delivery mechanism. Wermke proved the optimum is 2-4 reviewers. Heartbleed, Shellshock, Log4Shell, and XZ Utils proved that “many eyes” in open source was aspirational, not actual.

AI councils with Byzantine fault tolerance consensus are the industrial-strength version of “many eyes.” Eyes that never get tired. Eyes that never skip the boring files. Eyes that can be made architecturally diverse by design rather than hoping for it.

The council doesn’t replace human review. It triages. L0-L1 changes get council-only review. L2+ changes get council review AND human review, with the council summarizing what it found so the human knows exactly where to look. The reviewer reads a focused summary, not a 400-line diff.

The result: human reviewers spend their attention on the 20% that the council flagged as worth examining. The 80% that’s genuinely low-risk gets processed without burning human cognitive bandwidth.

Linus’s Law, upgraded. Not “many eyes.” The right eyes, structured to disagree, with the math to prove it works even when some of them are wrong.

If your review process can’t keep up with AI-generated code velocity, the answer isn’t more reviewers. It’s smarter review architecture.

I run AI Readiness Sessions where we design multi-model review pipelines matched to your risk profile. No slides, no theory — just a working architecture.

Book a free 30-minute consultation