A Biologist's Risk Model for AI Governance
Why risk tiers, not tools -- and how biological safety levels led me to L0-L4.
I’m a biologist by education. In a biology lab, you don’t review every sample the same way. A BSL-1 pathogen sits on an open bench. A BSL-4 pathogen requires a pressure suit and an airlock. The containment matches the risk. When I looked at AI code review, I saw every PR getting BSL-1 treatment — including the ones that should have been BSL-4.
The System That Actually Works
The Biosafety Level system has been reducing lab-acquired infections for over 50 years. The CDC codified it in the “Biosafety in Microbiological and Biomedical Laboratories” manual, now in its 6th edition. The WHO adopted it globally.
The results are measurable. A Lancet Microbe scoping review found 309 laboratory-acquired infections and 16 pathogen escapes worldwide between 2000 and 2021 — a dramatic decline from the roughly 4,000 occupational infections reported by 1976, before BSL levels were standardized. Lab count increased massively in that period. Incident rates dropped by an order of magnitude.
The design insight that matters: BSL-1 requires basic hand-washing and open bench work. BSL-4 requires dedicated buildings, negative-pressure rooms, and decontamination of all materials on exit. The system doesn’t treat everything as maximum risk. It reserves expensive controls for the cases that actually need them.
About 70% of the incidents that still occur are attributable to human error — people not following procedures for the tier they’re working in. The tiering model itself holds up. The failures come from people ignoring the tier.
The Flat-Priority Problem
Software review has no tiers. Every pull request looks the same in the queue. A whitespace fix and a security-critical authentication change get the same approval workflow: someone clicks a green button.
The data on what this causes is brutal.
SmartBear’s canonical study of 2,500 code reviews across 3.2 million lines at Cisco found that 61% of reviews uncovered zero defects. Defect detection drops sharply after 200-400 lines. Effectiveness collapses after 60-90 minutes of continuous review.
In security operations, the pattern is identical. A 2024 survey found that 62% of security alerts are entirely ignored. 40% are never investigated. 61% of teams admitted to ignoring alerts that later proved critical. The average organization receives 960 alerts daily from 28 different tools. Only 3% need immediate action.
That’s the flat-priority failure mode. When everything has the same urgency, nothing does. Reviewers develop what the research calls “approval fatigue” — the cognitive equivalent of alarm fatigue in hospitals, where nurses hear so many beeping monitors that they stop hearing them.
Now pour AI-generated code into this already-broken system. The Faros AI Productivity Paradox report, drawn from telemetry of 10,000+ developers across 1,255 teams, found that high AI adoption increased PRs merged by 98% — but PR review time increased by 91%. Bugs per developer increased by 9%. Average PR size increased by 154%.
More code. Same reviewers. Longer reviews. More bugs. That’s the bottleneck Amdahl’s Law predicted.
L0 Through L4: The BSL Model for Code
The pattern I keep encountering across industries that handle risk well is the same: triage before review. Route by severity. Reserve expensive controls for what actually needs them.
Aviation does this. A-checks (every 400-600 flight hours) are visual inspections requiring 50-70 man-hours. D-checks (every 6-10 years) are complete aircraft disassembly requiring up to 50,000 man-hours. The commercial aviation fatality rate is 0.07 per million flights.
Nuclear safety does this. The NRC’s Significance Determination Process uses a four-color system. About 97% of inspection findings are Green — important to correct but low significance. Less than 1% are Yellow or Red. The expensive regulatory response is reserved for the 1% that warrants it.
Financial trading does this. Pre-trade blocks auto-reject orders exceeding thresholds. Circuit breakers halt individual stocks at moderate moves. Market-wide halts trigger only at dramatic index drops. The containment scales with the severity.
So I built the same model for AI-governed artifacts:
L0 (Trivial): Cosmetic changes — typos, formatting, whitespace. No review required. Auto-approve. This is BSL-1. Open bench.
L1 (Low risk): AI review only. The council evaluates it, human gets notified but doesn’t need to act. Most changes fall here.
L2 (Medium risk): AI review plus human notification. Human sees a summary, can intervene if something looks wrong.
L3 (High risk): AI review plus required human approval. Change blocks until a human signs off. This is where authentication changes, security-critical logic, and financial calculations land.
L4 (Critical): Multi-reviewer approval, full evidence bundle, complete audit trail. This is BSL-4. Pressure suit and airlock.
Why Triage Changes Everything
The math is straightforward. If 80% of changes are L0-L1 and can be handled without human intervention, you’ve just freed 80% of your review capacity for the 20% that actually matters.
The Cisco data says 61% of reviews find nothing. The NRC data says 97% of findings are Green. The security data says only 3% of alerts need immediate action. Every domain that measures this arrives at the same distribution: most items are low-risk, and expensive review of low-risk items burns capacity you need for the high-risk ones.
Without triage, reviewers treat a formatting fix with the same attention as a privilege escalation. They can’t — their brains won’t sustain that. So they calibrate down. Everything gets the fast glance. Including the things that shouldn’t.
The Counter-Arguments (And They’re Real)
The strongest objection to risk tiering: what if the classifier gets it wrong? What if a critical change gets mislabeled L0 and auto-approves?
This is a legitimate concern, and the EU AI Act is a cautionary tale. The European Center for Not-for-Profit Law documented that the Act includes a “self-exemption” loophole — providers can unilaterally decide their system doesn’t pose significant risk. Strong lobbying resulted in overreliance on self-certification with weak oversight.
The Oxford Martin AI Governance Initiative published a research memo in June 2025 warning that risk tiers must be dynamically reassessed — a system that was low-risk at deployment may become high-risk as it scales or gets repurposed.
In third-party risk management, nearly 40% of businesses neglect regular risk reassessment. Tier classifications go stale. Only 34% of enterprises maintain a comprehensive vendor ledger.
These failures share a common root: the entity being governed gets to choose its own risk tier. Self-certification degrades to theater.
The BSL system works because the classification is external. A pathogen’s risk level is determined by published evidence about transmission, lethality, and available treatments — not by the researcher who wants to work with it. The containment decision is made by the institution’s biosafety committee, not by the person submitting the sample.
GuardSpine follows this model. The risk classifier is independent. It analyzes the diff itself — what files changed, what functions were modified, what the semantic impact is. The developer who submitted the change doesn’t get to override the tier. L3 and L4 changes block until appropriate review happens, regardless of who submitted them.
Why BSL, Not CVSS?
The other objection: existing security frameworks already do risk rating. CVSS scores vulnerabilities from 0-10. STRIDE categorizes threats. OWASP has its own risk rating methodology.
These frameworks are designed for security professionals evaluating known vulnerabilities. They’re precise, quantitative, and require specialized knowledge to apply.
The BSL model communicates to everyone. When I tell an engineering manager “this change is BSL-4,” they immediately understand it means maximum containment, maximum review, maximum caution. When I tell them “this is CVSS 9.1,” they reach for the documentation to figure out what that means.
The Oxford Martin AIGI concludes that AI governance should build on existing standards from aviation, energy, and finance rather than creating entirely new frameworks. The BSL metaphor isn’t replacing CVSS. It’s making the same principle accessible to the people who approve budgets and set policy — people who may never read a CVSS calculator but who instantly grasp “some changes need pressure suits.”
What This Looks Like in Practice
A developer opens a PR. GuardSpine’s classifier analyzes the diff before any human sees it.
The classifier checks: What files changed? Are they in security-critical paths? Do the changes touch authentication, authorization, or financial logic? What’s the blast radius if this change is wrong? Has the test coverage changed? Are there new dependencies?
Based on this analysis, the change gets a tier. L0-L1 changes get council review and auto-merge. L2 changes get council review plus a human notification with a summary of what the council found worth flagging. L3-L4 changes block until the required number of human reviewers sign off.
The result: reviewers spend their attention budget where it matters. The 61% of reviews that found nothing at Cisco? Most of those were probably L0-L1 changes that didn’t need a human in the loop at all.
If your team is drowning in review queues and rubber-stamping changes they can’t meaningfully evaluate, the problem isn’t the reviewers. It’s the lack of triage.
I run AI Readiness Sessions where we assess your current review bottleneck and design a risk-tiered governance model that matches your domain. No slides, no theory — just a working plan.