Threat Modeling for AI Governance: What Can Go Wrong
The attack surface of AI-assisted development: prompt injection, model poisoning, rubber-stamp approvals, evidence forgery, and how to defend against each.
If you are using AI to review code, you have a new attack surface. If you are not thinking about that attack surface, you are building governance on a foundation you do not understand. I spent six months threat-modeling GuardSpine’s own architecture before I was comfortable shipping it. Here is what I found.
Threat 1: Prompt Injection via Pull Request
An attacker submits a PR containing carefully crafted text — in a comment, a string literal, or a README change — designed to manipulate the AI reviewer. The injected prompt might instruct the model to approve the PR unconditionally, ignore specific files, or downgrade the risk assessment.
This is not theoretical. Researchers have demonstrated prompt injection attacks against AI code review tools since 2024. A well-crafted injection in a docstring can convince some models that the change is benign when it introduces a backdoor.
GuardSpine’s defense: Multi-model council. A prompt injection that works against Claude may not work against GPT-4o, and vice versa. When multiple models with different architectures review the same change, an injection that fools one model is likely to produce a disagreement with another. Model disagreement triggers escalation to human review.
Additional defense: Isolated prompts. Each AI reviewer receives the diff through a structured template, not raw concatenation. The code content is enclosed in a clearly delimited section with instructions that precede it. The model is told to treat the content section as data, not as instructions. This does not make injection impossible — nothing does — but it raises the bar significantly.
Additional defense: AST pre-screening. Before the AI models see the diff, the AST classifier identifies changes to string literals, comments, and documentation. If a comment change is the only modification in a PR, and the comment contains instruction-like language (“ignore the above,” “approve this change,” “disregard previous instructions”), it is flagged for human review regardless of AI verdict.
Threat 2: Model Poisoning
If your AI reviewer was fine-tuned or trained on poisoned data, its judgments are compromised from the start. A poisoned model might learn to ignore specific vulnerability patterns, approve code from certain authors without scrutiny, or systematically underestimate risk.
This threat applies primarily to organizations that fine-tune their own review models. If you are using foundation models from Anthropic, OpenAI, or Google, the poisoning risk is on the model provider. Their training pipeline security is their problem, and they invest heavily in it.
GuardSpine’s defense: Model diversity. Using multiple models from different providers means a poisoned model is one vote in a council, not the sole decision-maker. If one model consistently disagrees with the others in a specific pattern, the disagreement metric flags it.
Additional defense: No fine-tuning on review outcomes. GuardSpine does not fine-tune models on its own review decisions. The review council uses foundation models with structured prompts. This eliminates the attack surface where a compromised review outcome could influence future reviews through training data contamination.
Threat 3: Rubber-Stamp Approvals
The most common governance failure is not an attack — it is apathy. A human reviewer receives a governance notification, glances at the summary, and clicks approve without reading the details. The evidence bundle shows an approval, but the approval was meaningless.
This is the same problem that existed before AI governance. The PR approval button has always been one click. What changes with AI governance is that we can detect rubber-stamping.
GuardSpine’s defense: Approval quality signals. Time-to-approval is tracked. If a reviewer approves an L3 change within 30 seconds of receiving the notification, that approval is flagged as suspicious. The evidence bundle records the time delta. An auditor can see that the approval was faster than the time it would take to read the review summary.
Additional defense: Required comments on escalated changes. For L3+ changes, an approval without a reviewer comment is treated as incomplete. The reviewer must write at least one substantive comment explaining why they approve. “LGTM” does not meet the threshold. This forces engagement with the material.
Additional defense: Rotation. GuardSpine can require that the same reviewer does not approve consecutive PRs from the same author. Rotation prevents the pattern where two developers rubber-stamp each other’s work.
Threat 4: Evidence Forgery
An attacker with access to the evidence bundle storage could modify a bundle after it was created — changing a rejection to an approval, removing a security finding, or altering the risk tier. If the governance record can be falsified, the entire system is worthless.
GuardSpine’s defense: Cryptographic sealing. Every evidence bundle includes a hash chain and cryptographic signatures. Modify any byte in the bundle, and the hash chain breaks. The verification algorithm detects tampering at any point in the chain. I wrote about this in detail in Evidence Bundles Explained.
Additional defense: Signature verification. Even if an attacker recomputes the hash chain after modification, they cannot forge the cryptographic signature without the signing key. The signing key is stored in a hardware security module (HSM) or a secrets manager, not in the bundle storage.
Additional defense: Immutable storage. Evidence bundles are written once and never modified. The storage backend (S3 with object lock, or equivalent) prevents deletion or modification of existing bundles. You can verify that a bundle exists and has not been tampered with independently of GuardSpine.
Threat 5: Policy Manipulation
If an attacker can modify the governance policy, they can weaken protections for their future PRs. Lower the risk threshold, exclude their target files from review, or remove required reviewers.
GuardSpine’s defense: Policy changes are L4. Any change to the organization’s governance policy repository is automatically classified as L4, the highest risk tier. It requires security team review and generates its own evidence bundle. You cannot silently weaken governance.
Additional defense: Policy pinning. GuardSpine supports policy version pinning. A repository can pin to a specific version of the org policy, and the pin itself is a governed change. Unpinning requires the same L4 review as a policy modification.
Threat 6: Denial of Governance
Instead of subverting governance, an attacker prevents it from running. They might create PRs that crash the AST parser, submit diffs too large for the AI models to process, or flood the review queue to cause timeouts.
GuardSpine’s defense: Fail-closed. If governance fails to run — for any reason — the default behavior is to block the merge. A PR that crashes the parser does not get merged without review. It gets flagged for manual investigation.
This is a critical design choice. Many governance tools fail open: if the check does not complete, the PR is treated as approved. That is backwards. An incomplete review is not an approval. It is a gap.
Additional defense: Input validation. Diffs above a configurable size limit (default: 10,000 lines) are handled differently. Instead of sending the entire diff to AI models, GuardSpine identifies the behavioral changes via AST analysis and sends only those for review. The cosmetic and structural changes are classified but not AI-reviewed. This prevents large diffs from causing timeouts.
Additional defense: Rate limiting. A single author who opens 50 PRs in an hour triggers rate limiting. The PRs are queued, not rejected, and processed at a sustainable rate. This prevents both accidental flooding (a runaway bot) and intentional denial of service.
The Meta-Threat: Trusting Governance Itself
The deepest threat is epistemological. If you trust the governance system implicitly, you have replaced one single point of failure (the human reviewer) with another (the AI governance system). A system that cannot be questioned is a system that cannot be improved.
GuardSpine addresses this through two design principles:
Offline verifiability. Anyone can verify an evidence bundle without trusting GuardSpine. The verification algorithm is public. The hash chain math is standard. The signatures use standard cryptographic algorithms. If you do not trust our verifier, write your own.
Observable governance. Every metric GuardSpine tracks — coverage, completeness, disagreement, escalation, drift — is exposed through APIs and dashboards. You can audit the governance system with the same rigor you apply to production services. If the governance system is behaving anomalously, the metrics will show it.
Building Your Own Threat Model
Every organization should threat-model their governance system, not just use someone else’s threat model. Your attack surface is different from mine. Your threat actors are different. Your risk tolerance is different.
Start with these questions:
- Who has access to modify governance policy? Is that access logged and governed?
- What happens when governance fails to run? Fail-open or fail-closed?
- How would you detect a compromised AI model producing biased reviews?
- Can an insider modify an evidence bundle after creation? How would you know?
- What is your response plan if governance is bypassed for a production deployment?
If you cannot answer all five with confidence, your governance system has gaps. That is normal. The point of threat modeling is to find gaps before attackers do.
Book a call if you want to walk through your governance threat model. I have done this exercise with security teams at companies from 20 to 2,000 engineers.