Risk Scoring for Pull Requests: Beyond Pass/Fail

Pass or fail. Approve or request changes. Green check or red X. Every code review system in widespread use reduces a complex risk assessment to a binary outcome.

That binary tells you almost nothing useful. A PR that “passed” might have scraped by with one minor concern overlooked. A PR that “failed” might have a single formatting issue in an otherwise excellent change. The binary hides the information you actually need to make decisions.

GuardSpine replaces pass/fail with multi-dimensional risk scoring. Here is why that matters and how it works.

The Problem With Binary

Binary review outcomes create three failure modes:

False confidence. A PR passes review. The team assumes it is safe. But the “pass” meant the reviewer found no blocking issues — not that the change carried zero risk. A payment processing change with a risk score of 0.38 and a documentation update with a score of 0.02 both “pass,” but they are not remotely equivalent.

Meaningless rejection. A PR fails review. The developer has to fix something. But what? A single finding might be a style nit or a critical vulnerability. Binary outcomes do not distinguish between “you forgot a semicolon” and “this creates a path traversal vulnerability.” The developer has to parse the comments to figure out which it is.

No trend visibility. Over time, are your PRs getting riskier or safer? Binary outcomes cannot answer this. You have a count of passes and fails, but no signal about the distribution of risk across your changes. You cannot see if your team is gradually drifting toward riskier patterns until something breaks.

Multi-Dimensional Scoring

GuardSpine computes risk scores across four dimensions for every PR:

Security risk (0.0 - 1.0). Measures exposure to security vulnerabilities: injection attacks, authentication bypasses, authorization gaps, cryptographic misuse, secrets exposure. A change that introduces a new SQL query scores higher than one that modifies a UI label.

Data risk (0.0 - 1.0). Measures exposure to data integrity and privacy concerns: PII handling, database schema changes, data transformation logic, backup and recovery implications. A migration that alters a user table scores higher than one that adds an analytics event.

Operational risk (0.0 - 1.0). Measures exposure to availability and performance concerns: resource consumption changes, error handling modifications, deployment configuration, dependency updates with breaking changes. A change to your connection pool configuration scores higher than a change to your test suite.

Compliance risk (0.0 - 1.0). Measures relevance to regulatory and policy requirements: changes to audit-logged operations, consent flows, data retention logic, access control rules. A change to your GDPR data deletion handler scores higher than a change to your internal admin dashboard.

Each dimension is scored independently. A change can carry high security risk but low data risk, or high compliance risk but low operational risk. The dimensions give you a profile of where the risk actually lives.

How Scores Are Computed

Each AI reviewer on the council produces per-dimension scores based on its assessment. The scores are then aggregated:

dimension_score = weighted_mean(reviewer_scores) * severity_multiplier

The weighted_mean accounts for reviewer confidence. A model that evaluated three security-specific rules contributes more to the security dimension than one that evaluated only general coding standards.

The severity_multiplier escalates the score based on finding severity. A single critical-severity finding in any dimension floors the score at 0.7, regardless of the mean. This prevents averaging away dangerous signals.

The composite risk score is not a simple average of the four dimensions either. It uses the maximum of the four as the base and adds a penalty for breadth:

composite = max(dimensions) + 0.1 * count(dimensions > 0.3)

A change that scores 0.8 on security but 0.1 on everything else gets a composite of 0.8. A change that scores 0.5 on all four dimensions gets a composite of 0.9. Broad, moderate risk is treated as seriously as narrow, high risk.

Thresholds and Actions

Risk scores drive concrete actions through configurable thresholds:

Below 0.2: Auto-approve eligible. The change carries minimal risk across all dimensions. For L0 and L1 changes, this can trigger automatic merge (if your policy allows it). Evidence bundle is still sealed.

0.2 to 0.4: Standard review. Moderate risk detected. Findings are surfaced to the developer. Acknowledgment may be required before merge depending on lane configuration.

0.4 to 0.6: Elevated attention. Significant risk in one or more dimensions. Multi-model consensus required. Specific findings must be resolved, not just acknowledged.

0.6 to 0.8: High risk. Mandatory human review regardless of AI consensus. All findings require explicit resolution with documented reasoning.

Above 0.8: Critical. Merge blocked until senior reviewer approval. The change is flagged for security team attention. Full evidence bundle with maximum detail.

These thresholds are defaults. You set them based on your risk tolerance. A financial services team might lower every threshold by 0.1. A hackathon project might raise them. The point is that the thresholds are explicit, configurable, and auditable.

Aggregation Across Reviewers

When multiple AI models review the same PR, each produces its own risk scores. GuardSpine aggregates these using a method designed to amplify consensus and preserve dissent.

If all three models score security risk between 0.3 and 0.4, the aggregated security score lands in that range. The models agree, and the consensus is reflected.

If two models score security risk at 0.1 and one scores it at 0.7, the aggregated score is not 0.3. The outlier is preserved as a dissent flag, and the aggregated score is pulled toward the outlier: typically landing around 0.4 to 0.5, with the dissent documented in the evidence bundle.

This is deliberate. A single model flagging a concern that others missed is exactly the scenario multi-model review is designed to catch. Averaging away that signal defeats the purpose.

Dashboard Visibility

Risk scores create a dataset that compounds in value over time. The GuardSpine dashboard surfaces:

Per-PR risk profiles. The four-dimension breakdown for every governed PR. Click any PR and see exactly where the risk lives and which reviewers flagged what.

Team risk trends. Weekly and monthly aggregates showing how your team’s risk profile is changing. Is security risk trending up? Are compliance-relevant changes increasing? The trend lines answer these questions before incidents do.

Hotspot identification. Files and modules that consistently score high across PRs. If src/auth/middleware.ts appears in high-risk PRs three times in a month, it is a hotspot that deserves architectural attention, not just review attention.

Threshold effectiveness. How often do changes in each score range actually cause incidents? This feedback loop lets you calibrate thresholds based on outcomes, not guesses.

Why Scores Beat Opinions

A risk score is not an opinion. It is a computed value derived from specific findings, weighted by severity, aggregated across independent reviewers, and recorded in a tamper-evident evidence bundle.

An opinion says “this looks fine.” A score says “this change carries 0.35 security risk due to a new database query without parameterization in line 47, flagged by two of three reviewers, with the third reviewer not evaluating SQL-specific rules.”

The score is actionable. The score is auditable. The score is comparable across changes, across teams, across time. The opinion is none of those things.

If you want to see risk scores computed against your actual PRs, book a demo at cal.com/davidyoussef/guardspine. I will show you the four-dimension breakdown for your last week of merged changes.