Making AI Claims Trustworthy with Epistemic Notation

Your AI says “this code change is safe.” How confident is it? Based on what evidence? Is that witnessed or inferred? Most AI systems can’t answer these questions. That’s a governance problem.

Every time an AI makes a claim, it’s making an implicit contract with whoever reads it. But the terms of that contract — the confidence, the evidence basis, the speech act type — are invisible. We’re trusting outputs with no metadata. That’s not trust. That’s hope.

I built VERIX to fix this.

The Problem: Naked Claims

Consider what happens when an AI reviews a pull request and says: “No security vulnerabilities detected.”

That statement could mean wildly different things. It could mean the model scanned the diff and found nothing suspicious (shallow pattern match, low confidence). It could mean it traced every input path and verified sanitization (deep analysis, high confidence). It could mean it’s parroting the output of a static analysis tool that already ran (reported evidence, ceiling set by the tool’s accuracy).

The consumer of that claim has no way to distinguish these scenarios. The words are identical. The epistemic status is completely different.

This isn’t academic. In regulated industries — healthcare, finance, biotech — the difference between “I observed this directly” and “I inferred this from partial data” is the difference between an auditable finding and a liability.

VERIX: A Grammar for Honest Claims

VERIX (Verified Epistemic Reasoning and Information eXchange) is a notation system that forces AI claims to carry their epistemic metadata explicitly. Every claim gets tagged with its properties.

The full grammar:

[illocution|affect] content [ground:source] [conf:X.XX] [state:status]

That looks dense. Let me break it apart.

Illocution is the speech act type — what the statement is doing. Five categories, borrowed from speech act theory:

assert: stating a fact or belief (“this function has no side effects”)
query: asking for information (“does this endpoint validate input?”)
direct: issuing an instruction (“run the test suite before merging”)
commit: making a promise (“I will flag any mutation of shared state”)
express: conveying an evaluation (“this architecture concerns me”)

Most AI outputs are unlabeled assertions. That’s a problem because assertions and expressions get conflated constantly. “This code is risky” — is the model stating a measured fact or expressing a subjective evaluation? The illocution tag forces the distinction.

Affect is optional emotional or evaluative coloring. It matters less for technical systems but becomes critical when AI communicates with humans about ambiguous situations. A finding tagged [assert|concern] reads differently than [assert|neutral], and it should.

Grounding: Where Did This Come From?

The ground field is where VERIX earns its keep. Four types:

witnessed — the model directly processed the evidence. It read the code. It executed the test. It parsed the log file. This is first-hand observation within the model’s context window.

reported — the model is relaying information from another source. A static analysis tool said this. A teammate commented this on the PR. The documentation states this. The claim’s reliability is bounded by the source’s reliability.

inferred — the model derived this from other evidence through reasoning. “Function A calls Function B, which writes to disk, therefore Function A has side effects.” The logic chain is traceable but each step introduces uncertainty.

assumed — the model is working from defaults, conventions, or priors. “Standard library functions are presumed safe.” This is the weakest ground, and the most common one that goes unlabeled in current AI systems.

Here’s what a VERIX-tagged code review looks like in practice:

[assert] No SQL injection vectors found in the auth module
  [ground:witnessed] [conf:0.92] [state:verified]

[assert] The payment module appears safe
  [ground:inferred] [conf:0.71] [state:provisional]

[express|concern] Error handling in the retry logic is sparse
  [ground:witnessed] [conf:0.88] [state:flagged]

Three findings. Three completely different trust profiles. Without VERIX, they’d all read as flat statements from the same authority. With VERIX, you know which one the model actually examined, which one it reasoned about indirectly, and which one is a subjective evaluation.

Confidence Ceilings: Honesty by Design

Raw confidence scores are meaningless without calibration. A model that says [conf:0.99] on everything is less useful than one that says [conf:0.72] and means it.

VERIX enforces confidence ceilings by evidence type:

Source Type	Ceiling	Rationale
model (self-generated)	0.95	Models have known blind spots and hallucination risk
doc (documentation)	0.98	Docs can be outdated but are generally reliable
user (human-provided)	1.00	User-stated facts are taken at face value
tool (external tool output)	0.97	Tools are deterministic but can have bugs

A model cannot claim 0.99 confidence on its own analysis. The ceiling won’t allow it. This is a structural constraint, not a suggestion. It encodes humility into the grammar itself.

The effect is practical: when you see [conf:0.95] with [ground:witnessed], you know that’s the maximum the system will ever claim for self-generated analysis. It passed every internal check. When you see [conf:0.60] with [ground:inferred], you know the model is being honest about a long inference chain with compounding uncertainty.

Compression Levels: Context-Appropriate Detail

Not every consumer needs full VERIX markup. A developer reading a code review wants different detail than an audit log. VERIX handles this with three compression levels:

L0 (full notation) — every tag present. Used for internal reasoning, audit trails, and machine-to-machine communication. This is the system of record.

[assert|neutral] Function auth_user has no side effects
  [ground:witnessed] [conf:0.91] [state:verified]

L1 (inline summary) — compressed to a single line with key metadata. Used for developer-facing output where context matters but verbosity doesn’t.

auth_user has no side effects (witnessed, 0.91, verified)

L2 (natural language) — VERIX metadata is woven into prose. Used for executive summaries, non-technical stakeholders, and anywhere formalism would create friction.

I directly examined auth_user and found no side effects.
I'm highly confident in this finding.

The underlying data is identical. The compression level is a presentation choice, not an information choice. Any L2 statement can be decompressed back to L0 for audit.

The Connection to Evidence Bundles

If you’ve read about evidence bundles in code governance — the idea that every AI decision should ship with its supporting artifacts — VERIX is formalizing the same trust problem at the claim level.

An evidence bundle says: “Here is the diff, the test results, the risk score, and the reviewer’s analysis, packaged together so you can verify the decision.” That’s artifact-level trust.

VERIX says: “Here is what this specific claim is doing, what evidence supports it, how confident the system is, and what the current status is.” That’s statement-level trust.

They’re complementary. An evidence bundle without VERIX-tagged claims is a box of artifacts with no guide to what they mean. VERIX claims without evidence bundles are metadata with nothing to point at. Together, they create auditable AI from the statement level up to the artifact level.

This matters for compliance. SOC 2, ISO 27001, FDA 21 CFR Part 11 — these frameworks don’t care about your model’s architecture. They care about traceability. Can you show what the system claimed, on what basis, with what confidence, and link it to supporting evidence? VERIX plus evidence bundles answers yes.

Why This Matters Now

The AI trust gap is widening. Models are getting more capable and more confidently wrong at the same time. The solution isn’t better models — it’s better metadata.

VERIX doesn’t make AI smarter. It makes AI honest about what it knows and doesn’t know. It forces every claim through a filter that asks: what kind of statement is this, where did the evidence come from, and how sure are you?

That’s not a nice-to-have for teams shipping AI into production. It’s the difference between an AI system you can defend to auditors and one you’re praying doesn’t embarrass you.

The grammar is simple. The discipline it enforces is not. But discipline is what separates tools from toys in regulated environments.

Implementing trustworthy AI? Let’s discuss your epistemic requirements: https://cal.com/davidyoussef