Vibe Coding Broke Code Review and Nobody Noticed

I was reviewing a pull request — 400 lines, mostly AI-generated — and I caught myself skimming. Not because I was lazy. Because the diff was too big to hold in my head, the tests passed, and the CI was green.

I clicked approve.

Then I stopped and thought: what did I just approve?

The Pattern I Keep Encountering

When I audit AI implementations at companies using Copilot or Cursor, I see the same sequence play out:

Team adopts AI coding tools
PR throughput triples
Review quality stays flat (or drops)
Someone asks “how did this payment logic change get approved?”
Engineering points to a green checkmark

That checkmark proves exactly one thing: someone clicked a button.

Here’s what happened in the twelve months before that PR I rubber-stamped. My team adopted Cursor. Throughput tripled. PRs got bigger. Reviews got faster — but not because reviewers got better. They got faster because reviewers started skimming.

Nobody said it out loud. Nobody wrote a memo. The incentive structure just shifted. When three PRs land in your queue before lunch and each one is 300+ lines of code you didn’t write, your brain does triage. You check the test output. You scan for obvious patterns. You click the green button.

This isn’t a discipline problem. It’s a capacity problem.

Microsoft Research published a study in 2022 showing that code review effectiveness drops significantly after 200-400 lines of code — reviewers miss more defects, spend less time per line, and approval becomes increasingly mechanical. That study was conducted before AI coding assistants went mainstream. The average PR size in Copilot-assisted codebases is now 2-3x what it was in 2022.

The human review process was designed for an era when developers wrote 50-100 lines of meaningful code per day. AI assistants generate that in seconds. We’re running a 2015 quality inspection process on a 2026 production line.

What Actually Slips Through

I started keeping a list. Over six weeks, I tracked the patterns that made it past review in codebases I was auditing. The categories were consistent:

Subtle logic errors. AI generates code that looks correct — variable names make sense, structure follows convention — but the branching logic has an off-by-one or a missing null check that only fires under specific conditions. These pass a quick visual scan. They even pass most unit tests, because the tests were also AI-generated from the same misunderstanding.

I saw this exact pattern cause a $47,000 billing error at a fintech client. The AI wrote a date range calculation that looked reasonable. It passed 14 unit tests. It was off by one day in leap years. Nobody caught it for three billing cycles.

Hallucinated API calls. The model invents a method signature that doesn’t exist in the version of the library you’re running. It compiles because TypeScript’s structural typing is permissive. It passes tests because the mocks match the hallucination. It fails in production at 2 AM.

I keep a running list of these. response.data.items.filter() when the actual API returns response.items. moment.parseZone() with parameters that work in v2.x but not the v2.29.x you’re actually running. axios.create({ timeout: '30s' }) when the timeout parameter expects milliseconds.

Security regressions. An AI assistant “helpfully” refactors an auth middleware and removes a rate limit check because it wasn’t part of the prompt context. The diff looks clean. The reviewer doesn’t know the rate limit existed because they didn’t write the original code.

The biotech VPs I talk to are particularly worried about this one. Their codebases have compliance-critical guards that look like ordinary if-statements. AI assistants don’t know the difference between a performance optimization and a regulatory requirement. Neither does a reviewer who’s skimming.

Dependency confusion. The model suggests a package name that’s close to a real one but isn’t. Or it pins a version that has a known CVE. The reviewer doesn’t check because the package.json change was buried in a 600-line diff.

In one audit, I found a package.json that referenced lodash-es (legitimate), loadash (typosquat, now flagged as malicious), and lodash.debounce (legitimate but pinned to a version with a prototype pollution vulnerability). All three were added in AI-assisted PRs. All three were approved.

Coverage theater. AI generates tests that achieve 95% line coverage but test nothing meaningful. The assertions check that functions return something, not that they return the right thing. The coverage report is green. The code is untested in any meaningful sense.

The pattern looks like this: expect(result).toBeDefined() instead of expect(result.total).toBe(150.00). The AI doesn’t know what the correct value should be, so it asserts existence. Existence isn’t correctness.

Every one of these passed human review. Not because the reviewers were bad. Because the volume and velocity made real review impossible.

The Math Nobody Wants to Do

Here’s the uncomfortable calculation.

If AI writes 80% of a PR’s code, and the reviewer meaningfully reads 20% of the diff (which is generous — the Microsoft study found reviewers spend about 60 seconds per 200 lines on average), you’re governing 16% of your codebase through human review.

That’s not a rounding error. That’s a structural gap.

A developer at a major streaming company shipped a vibe-coded change that cost $300,000 per day for 10 days. The approval? Someone clicked a green button. The post-mortem found that the reviewer spent under 90 seconds on a 1,200-line diff. Nobody blamed the reviewer. The review process was never designed for diffs that size at that frequency.

I’ve started pulling this data for clients. The pattern is consistent:

Average time-to-approve for PRs over 200 lines: 4-8 minutes
Approval rate for AI-assisted PRs: 94%
Approval rate for human-written PRs: 87%
Lines changed per review comment: 180-250 (should be 50-80)

AI-assisted PRs have a higher approval rate than human-written PRs. Not because the code is better. Because the diffs are larger and reviewers have less context per line. The green button gets clicked faster when there’s more code to review, not slower.

This Isn’t a Tooling Problem

The instinct is to fix this with better tools. Better diff viewers. AI-assisted review summaries. Smarter linters.

Those help at the margins. They don’t fix the structural issue.

The structural issue is this: the production rate of code now exceeds the human capacity to verify it.

No amount of tooling changes that equation. The factory got faster. The quality inspection line didn’t.

When I realized this, I stopped thinking about better review tools and started thinking about a different question: what if the review process itself needs to change?

Not “make humans review faster.” Not “add more reviewers.” Instead: triage before review. Classify changes by risk. Route low-risk changes through automated validation. Reserve human attention for the changes that actually need human judgment.

This is the insight that eventually led me to build GuardSpine. But before the product, there was the pattern. And the pattern came from an unexpected place: biology.

Why I Thought About This Like a Biologist

I’m a biologist by education. Molecular biology, specifically. And in a biology lab, you don’t handle every sample the same way.

A BSL-1 pathogen — E. coli K-12, the workhorse of molecular biology — sits on an open bench. Standard PPE. No special containment. The risk is minimal.

A BSL-4 pathogen — Ebola, Marburg — requires a pressure suit, an airlock, chemical showers on exit, and dedicated air handling. The containment matches the hazard.

This isn’t bureaucracy. It’s resource allocation. You don’t put every sample through BSL-4 protocols because you’d never get anything done. You don’t put Ebola on an open bench because someone dies.

When I looked at code review through that lens, I saw every PR getting BSL-1 treatment. Whitespace changes. Config tweaks. Cosmetic formatting. All getting the same review process as payment logic rewrites and auth middleware changes. The same green button. The same 4-minute review window. The same skimming.

We were treating typo fixes and security-critical changes identically. And then wondering why security-critical changes were getting missed.

That’s when I started thinking about risk tiers for code changes. L0 through L4. Trivial through critical. Match the review intensity to the hazard level.

But that’s the next post in this series.

What You Can Check Right Now

Before I built anything, I ran a simple analysis on my own repos. You can do the same.

Pull your team’s PR metrics for the last 90 days. If you’re on GitHub Enterprise, this query gets you started:

SELECT
  pr.number,
  pr.additions + pr.deletions as lines_changed,
  TIMESTAMPDIFF(MINUTE, pr.created_at, pr.merged_at) as minutes_to_merge,
  (SELECT COUNT(*) FROM review_comments WHERE pull_request_id = pr.id) as comment_count
FROM pull_requests pr
WHERE pr.merged_at > DATE_SUB(NOW(), INTERVAL 90 DAY)
ORDER BY lines_changed DESC;

Look at three numbers:

Average time-to-approve for PRs over 200 lines. If it’s under 10 minutes, your reviewers are skimming. If it’s under 5 minutes, they’re not reading at all.
Approval rate for AI-assisted PRs vs human-written PRs. If AI-assisted PRs have a higher approval rate, the review process isn’t catching the difference. (You can identify AI-assisted PRs by looking for Copilot/Cursor commit signatures or the telltale patterns in commit messages.)
Lines changed per review comment. Healthy teams comment every 50-80 lines. If your ratio is 200+, reviewers are doing a cursory scan, not a review.

Most teams I work with score badly on all three. Not because they’re careless. Because the system they’re operating in was built for a different velocity.

Where This Goes

PR approval proves someone clicked a button. It doesn’t prove they understood what they were approving.

That gap between “approved” and “understood” is where the failures live. It’s where the $300k/day incidents come from. It’s where the audit findings land. And it’s widening every quarter as AI writes more of the code.

I spent the next year building a system to close that gap. It starts with risk tiers that classify changes by hazard level. It adds a council of AI reviewers that argue with each other — because if one model hallucinates, the others catch it. And it produces tamper-evident evidence bundles that prove what was reviewed and why, verifiable by anyone without trusting my infrastructure.

But the first step was admitting the problem: vibe coding broke code review, and nobody noticed because the green buttons kept turning green.

If your team reviews AI-generated code and you’re not sure what’s slipping through, I’ve been running this analysis for teams for six months. The patterns are consistent. The fixes are tractable. Let’s compare notes.

Book a free 30-minute session

This is Part 1 of the Origin Story series — how I went from catching myself rubber-stamping a PR to building work provenance infrastructure.

Next: Everything Is a Diff — why the code review crisis is just the beginning. PDFs, spreadsheets, contracts, slide decks — AI is rewriting all of them, and nobody’s diffing any of it.