Back to Insights
Governance Observability: The Metrics Nobody Tracks (Yet)
AI Governance Code Governance GuardSpine Observability Metrics

Governance Observability: The Metrics Nobody Tracks (Yet)

DevOps has observability. Governance does not. Here are the metrics you should be tracking: review coverage, evidence completeness, model disagreement, escalation frequency, and policy drift.

You would never run a production service without metrics on latency, error rate, and throughput. But most organizations run their entire code review process with zero instrumentation. They have no idea what percentage of changes are actually reviewed, how long reviews take, or whether the quality of reviews is improving or degrading.

I built GuardSpine’s observability layer because I got tired of answering “how is governance going?” with “I think it’s fine.”

The Five Metrics That Matter

After running governance across hundreds of repositories and talking to dozens of teams, I have narrowed the essential metrics to five. Everything else is either derived from these or vanity.

1. Review Coverage Ratio

Definition: The percentage of merged PRs that received a complete governance review before merge.

This sounds obvious. It is not. In most organizations, the assumed coverage is 100% — every PR requires an approval before merge. The actual coverage is lower because of:

  • Emergency merge bypasses (admin merge without review)
  • Bot-generated PRs that skip human review
  • Dependency update PRs that get rubber-stamped
  • PRs where the reviewer clicks approve without reading the diff

GuardSpine tracks this by comparing the set of merged PRs against the set of PRs with completed evidence bundles. If a PR was merged but has no bundle, it was not governed. If it has a bundle but the bundle is incomplete (missing policy evaluations or AI review verdicts), it was partially governed.

Target: 95%+ for production branches. Below 90% means your bypass process is too permissive. At 100%, you probably have not accounted for legitimate emergencies.

2. Evidence Bundle Completeness

Definition: The percentage of evidence bundles that contain all required evidence items as defined by the active policy.

A bundle exists, but is it complete? An org policy might require: diff, risk tier assessment, at least two AI reviewer verdicts, policy evaluation results, and a rationale for any override. If the bundle has the diff and the risk tier but is missing the AI verdicts, it is incomplete.

Incompleteness usually means a timeout. An AI model did not respond in time, and the pipeline moved on without it. Or a policy evaluation errored and was skipped. These are not catastrophic, but they are compliance gaps. An auditor who sees incomplete bundles will ask questions you do not want to answer.

Target: 98%+ completeness. Track the reasons for incompleteness. If one AI model is responsible for 80% of missing verdicts, you have a reliability problem with that model, not a governance problem.

3. Model Disagreement Rate

Definition: The percentage of PRs where AI reviewers reach different verdicts (one approves, another flags concerns).

This is the most interesting metric in the set. Low disagreement means your models are redundant — they all see the same things. High disagreement means either the change is genuinely ambiguous or one of your models is miscalibrated.

GuardSpine uses multi-model review by default. When Claude approves a PR and GPT-4o flags a potential issue, that disagreement is informative. It means the change has characteristics that different models evaluate differently. Those are exactly the changes that need human attention.

Track disagreement rate over time. A sudden spike means something changed — new code patterns, a model update, or a policy change that one model handles differently than another. A steady rate of 8-15% is healthy. Below 5% suggests your models are too similar. Above 20% suggests your policies are ambiguous.

Derived metric: Disagreement resolution. When models disagree, who was right? Track the outcome of human review on disagreement cases. If the human consistently sides with Model A, Model B might need recalibration or replacement.

4. Escalation Frequency

Definition: The percentage of PRs that escalate beyond the default review tier.

GuardSpine auto-resolves L0-L1 changes and escalates L2+ for human review. But within the human review tier, further escalation happens when a reviewer flags something that exceeds their authority — a junior reviewer encountering a security-critical change, or a team lead discovering a cross-service impact that needs architecture review.

Escalation frequency tells you whether your risk tiering is calibrated correctly.

Too many escalations (>15%): Your base tier is too low. Changes that should be L2 are being classified as L1, then escalated when the reviewer realizes the complexity. Raise the tier thresholds.

Too few escalations (<2%): Either your risk tiering is perfect (unlikely) or your reviewers are not escalating when they should be. Check whether L2 changes are being approved without the scrutiny they deserve.

Escalation target time: How long does an escalated review take? If escalations sit for days, you have a capacity problem at the senior reviewer level. This is an early warning signal for review bottlenecks.

5. Policy Drift

Definition: The delta between the organization’s stated governance policy and the effective policy being enforced across repositories.

Policy drift happens in two directions:

Strictness drift: Teams add local overrides that make their policy stricter than the org baseline. This is usually fine but can cause unnecessary friction. If 40% of teams have raised their threshold above the org baseline, maybe the org baseline should be higher.

Leniency drift: Teams request exceptions, add exclusion paths, or modify review requirements. Each individual exception is justified. In aggregate, they erode the baseline. After six months, the “centralized” policy has been overridden in so many places that it is effectively advisory.

GuardSpine tracks drift by diffing each repository’s effective policy against the org baseline. The drift score is the number of policy fields that differ from baseline, weighted by the field’s security impact. A change to excluded paths is low-weight. A change to the minimum risk threshold is high-weight.

Target: Review drift scores monthly. Any repository with a drift score above the threshold gets flagged for policy review. The security team decides whether to accept the drift, update the baseline, or revoke the override.

Instrumenting Governance

These five metrics require instrumentation. You cannot compute review coverage ratio from git logs alone. You cannot measure evidence completeness without parsing bundles. You cannot track model disagreement without structured review output.

GuardSpine exposes these metrics through three channels:

Webhook events. Every governance action emits a structured event: PR triaged, review started, verdict reached, bundle sealed, escalation triggered. Forward these to your observability stack — Datadog, Grafana, Splunk, whatever you already use.

API endpoints. Query governance metrics programmatically. “Give me review coverage for the payments team over the last 30 days.” “Show me all PRs with model disagreement above 2 verdicts.” “List repositories with policy drift scores above 5.”

Dashboard. A built-in view for teams that do not want to build custom dashboards. Coverage, completeness, disagreement, escalation, and drift on a single screen with 7-day and 30-day trends.

The Feedback Loop

Metrics without action are just numbers. The value of governance observability is the feedback loop it enables.

Low coverage? Investigate bypass patterns and tighten branch protection rules.

Incomplete bundles? Check AI model reliability and increase timeout budgets.

High disagreement? Review the disputed PRs and recalibrate model selection.

Escalation bottleneck? Hire more senior reviewers or adjust tier thresholds to reduce escalation volume.

Policy drift? Conduct a quarterly policy review and update the baseline.

Each metric has a clear action. Each action improves the next measurement cycle. This is how governance goes from a checkbox exercise to a continuously improving system.

You instrument uptime because downtime costs money. You should instrument governance because unreviewed changes cost more.

Book a call if you want to set up governance observability for your organization. I will show you the dashboard.