Back to Insights
AI Memory Metadata Accountability Evidence Bundles

WHO/WHEN/PROJECT/WHY: A Tagging Protocol for AI Memory

Metadata that makes AI memory searchable and accountable. Four required fields that turn write-only vector stores into auditable knowledge systems.

Most teams store AI outputs. Almost none store why the AI made that decision, who asked it, or what project it was for. Six months later, the audit trail is worthless.

I watched this play out across a dozen enterprise deployments. Teams would dump thousands of AI-generated artifacts into a vector store, celebrate their “knowledge base,” then discover they couldn’t answer basic questions. Which agent produced this recommendation? Was it responding to a customer request or an internal review? What project was it scoped to?

The artifacts were there. The context was gone.

Unstructured Memory Is a Write-Only Database

Here’s the failure pattern. An AI agent generates a risk assessment. It gets embedded into ChromaDB or Pinecone with maybe a timestamp and a content hash. Three months later, someone queries “risk assessments for Project Alpha” and gets back fifteen results with no way to distinguish which ones were final recommendations versus exploratory drafts, which agent produced them, or what prompted the analysis in the first place.

You built a search engine for text similarity. You did not build a memory system.

The difference matters when compliance asks who authorized a decision. It matters when you’re debugging why an agent went off the rails last Tuesday. It matters when two projects share terminology and your vector search can’t tell them apart.

Unstructured memory scales beautifully for ingestion and terribly for accountability.

The Four Fields

After building memory systems that actually survived audits, I landed on four required metadata fields for every memory entry. No exceptions, no optional fields, no “we’ll add it later.”

WHO: Agent attribution. Not just “Claude” or “GPT-4” but the specific agent identity, role, and permission tier. In a multi-agent system, you might have six agents all running the same model. WHO distinguishes the code review agent from the security audit agent from the planning agent. This field also captures the human requestor when applicable — the person who initiated the chain.

WHEN: Temporal context with three layers. Creation timestamp is obvious. But you also need decision-window — the time range of data the agent considered when forming its output. And you need expiry — when this memory should decay or be flagged for review. A market analysis from January shouldn’t carry the same weight in July.

PROJECT: Scope binding. Every memory entry belongs to exactly one project scope. This sounds simple until you realize most organizations let AI agents work across project boundaries without tracking which context informed which output. PROJECT creates hard partitions. An agent working on Project Alpha cannot accidentally retrieve and act on Project Beta memories unless explicitly granted cross-project access.

WHY: Intent classification. This is the field most teams skip and the one that matters most. WHY captures the intent type behind the memory creation. I use eight categories:

  • DECISION: A choice was made between alternatives
  • OBSERVATION: A fact was recorded without action
  • RECOMMENDATION: A suggested course of action
  • CONSTRAINT: A limitation or requirement was identified
  • QUESTION: An unresolved inquiry was logged
  • CORRECTION: A previous memory was amended
  • ESCALATION: An issue was flagged for human review
  • SYNTHESIS: Multiple inputs were combined into a conclusion

These eight types cover every memory entry I’ve encountered in production. The category determines how the memory gets weighted in future retrievals. A DECISION carries more weight than an OBSERVATION. A CORRECTION supersedes the entry it references. An ESCALATION triggers human-in-the-loop workflows.

Enforcement Through Hooks

Tagging protocols only work if they’re mandatory. The moment you make metadata optional, agents skip it, developers forget it, and your memory system degrades back to an unstructured blob.

I enforce the four fields through insertion hooks. Every write to the memory system passes through a validation layer that rejects entries missing any of the four fields. No WHO? Rejected. No WHY intent type? Rejected. No PROJECT scope? Rejected.

The hook runs in under 2 milliseconds. That’s negligible latency for permanent accountability.

On the retrieval side, every query automatically scopes by PROJECT unless the caller has explicit cross-project permissions. This prevents the most common contamination pattern — an agent pulling context from the wrong project because the text happened to be semantically similar.

Searchability Changes Everything

With four-field tagging, queries that were impossible become trivial.

“Show me all DECISION memories from the security-audit agent on Project Alpha in the last 30 days.” That’s a structured query against four indexed fields. Sub-millisecond response. No vector search needed.

“Find all CORRECTION entries that reference a specific DECISION.” That’s a graph traversal on the WHY field with reference linking. You can trace the full revision history of any decision.

“Which agent generated the most ESCALATION entries last week?” That’s an aggregation query that surfaces agents that might be hitting capability limits or encountering novel edge cases.

None of this is possible with embeddings alone. Vector search answers “what sounds like this?” Structured metadata answers “what happened, who did it, and why.”

The Evidence Bundle Connection

If you’ve read my previous writing on evidence bundles for code governance, the WHO/WHEN/PROJECT/WHY protocol will look familiar. It’s the same principle applied to memory instead of code review.

An evidence bundle captures: what was reviewed, who reviewed it, what they found, and what evidence supports the finding. WHO/WHEN/PROJECT/WHY captures: what was remembered, who generated it, what scope it belongs to, and what intent produced it.

They’re the same accountability pattern at different layers of the stack. Evidence bundles govern code changes. Memory tagging governs AI reasoning. Together, they create an audit trail from “the agent thought X” all the way through “and here’s the code change that resulted, with cryptographic proof of review.”

This is what real AI governance looks like. Not a policy document. Not a dashboard. A structural guarantee that every decision is traceable from thought to action.

The Compound Effect

The real payoff shows up over months, not days.

Week one, you have a few hundred tagged memories. Useful for basic queries but nothing you couldn’t manage manually.

Month three, you have thousands. Patterns emerge. You can see which projects generate the most security findings. Which agents produce the most referenced decisions. Which intent categories correlate with high-value knowledge.

Month six, the tagged memory store becomes your institutional knowledge base. New team members can query it: “Show me all architecture decisions for the payments system.” Instead of reading stale documentation or asking senior engineers to repeat themselves, they get timestamped, attributed, context-rich knowledge with full provenance.

Year one, you have a longitudinal record of every AI-assisted decision your team has made. You can answer questions like “when did we decide to switch from REST to gRPC and why?” with a single query instead of an archaeological dig through Slack threads.

None of this works without consistent tagging from day one. Retroactively tagging thousands of unstructured entries is a project nobody wants to do. The discipline has to be enforced at the system level, not the human level.

What This Costs

Storage overhead per memory entry: roughly 200 bytes for the four metadata fields. On a system processing 10,000 memory writes per day, that’s 2MB of additional storage daily. Essentially free.

Query performance improvement: 10-50x faster for scoped retrievals compared to vector-only search, because you’re hitting indexed fields instead of computing cosine similarity across your entire corpus.

Implementation time: a senior engineer can add four-field tagging to an existing memory system in two days. The hook validation layer is maybe 40 lines of code. The query interface changes are another 100.

The cost of not doing it is an AI system that can’t explain itself six months from now. That’s the cost that actually matters.


Want traceable AI decisions? Let’s implement accountable memory.