The Dogfooding System: AI That Tests Itself

My connascence analyzer runs on code that my AI agents generate. The agents use the analyzer’s feedback to improve their next generation. The tool tests the maker.

That’s dogfooding — using your own output as your own input. The term comes from Microsoft in the 1980s, when the company started using pre-release Windows internally. If Windows crashed for Microsoft employees, it got fixed before customers saw it. The people building the product were also the people suffering from its bugs.

I’ve taken this concept and applied it to AI code generation. My AI agents write code. My quality tools analyze that code. The analysis feeds back to the agents. The agents adjust. The quality tools analyze the adjusted output. The cycle repeats.

The result is AI that gets measurably better at generating code, not because the model improved, but because the feedback loop shaped its behavior.

How the Cycle Works

The cycle has four stages.

Stage 1: Code Generation

An AI agent writes code. This might be a new module, a refactoring of existing code, a test suite, or a bug fix. The agent operates within a project context — it sees the existing codebase, understands the project’s conventions, and has access to the project’s quality standards.

The output is a code artifact: a file or set of files that accomplish a specific task.

Stage 2: Quality Analysis

The connascence analyzer runs on the generated code. Connascence is a measure of coupling between code elements. There are 9 types, ranging from trivial (connascence of name — two modules use the same identifier) to severe (connascence of identity — two modules must reference the exact same object instance).

The analyzer produces a detailed report:

Coupling count: Total number of connascence instances detected
Coupling severity: Weighted score based on coupling type (identity coupling is scored 5x higher than name coupling)
Coupling locality: Whether the coupling is within a module (acceptable) or across module boundaries (problematic)
Trend: How this code compares to the project’s historical coupling averages

The report isn’t a pass/fail gate. It’s a profile. High coupling isn’t always wrong — sometimes tight coupling is the correct design choice. But high coupling that deviates from the project’s historical patterns is a signal worth investigating.

Stage 3: Feedback Injection

The analysis report feeds back to the AI agent. Not as raw data — as structured feedback that the agent can act on.

The feedback format is:

COUPLING REPORT for module: auth_handler.py
- Connascence of Algorithm detected between auth_handler.hash_password()
  and user_model.verify_password(). Both implement bcrypt with matching
  round counts. If the round count changes in one, the other must change.
  Recommendation: Extract round count to shared config.
- Connascence of Timing detected between session_manager.create_session()
  and auth_handler.login(). create_session must be called before login
  returns. Recommendation: Make the temporal dependency explicit via
  return type (return session from create_session, pass it to login response).

This is actionable. The agent knows exactly what to fix and how to fix it. Vague feedback like “reduce coupling” doesn’t work — the agent doesn’t know which coupling or how to reduce it. Specific feedback with concrete recommendations works.

Stage 4: Revision

The agent revises the code based on the feedback. The revised code goes back through Stage 2. If the coupling profile improved, the revision is accepted. If it didn’t, the agent gets another round of feedback with adjusted recommendations.

Most revisions converge in 1-2 cycles. The first revision addresses the structural issues (extracting shared config, making dependencies explicit). If a second revision is needed, it’s usually because the first revision introduced new coupling while fixing the old coupling — a common pattern when agents move code between modules without updating all references.

Violation Pattern Storage

Every coupling violation gets stored in Memory MCP — my cross-session persistence layer. The storage isn’t just a log. It’s a searchable index of patterns.

When the connascence analyzer detects a coupling violation, it stores:

The violation type (which of the 9 connascence types)
The code pattern that caused it (e.g., “duplicated algorithm across two modules”)
The project and module context
The recommended fix
Whether the agent successfully fixed it on the first try

This storage serves two purposes.

Pattern recognition: When the same violation type appears across multiple projects, it indicates a systematic tendency in the AI’s code generation. If the agent keeps creating connascence of algorithm by duplicating hash functions, the root cause isn’t any individual code generation — it’s a gap in the agent’s understanding of shared configuration patterns.

Fix rate tracking: For each violation type, I can measure: how often the agent produces it (frequency), how often it fixes it on the first try (first-fix rate), and how many revision cycles it takes on average (convergence speed). These metrics tell me whether the feedback loop is actually working.

Cross-Project Learning Transfer

Here’s where dogfooding gets interesting. The violation patterns stored from Project A inform code generation for Project B.

When an agent starts generating code for a new project, the system queries Memory MCP for relevant violation patterns. “This agent, when writing authentication modules, tends to duplicate hashing algorithms across files. Pre-load the shared config pattern.”

The pre-loaded pattern goes into the agent’s context as a preventive instruction: “When implementing hashing, define the algorithm and parameters in a single config location and import from there. Do not duplicate hashing logic across modules.”

This is transfer learning at the prompt level. The agent doesn’t retrain. Its weights don’t change. But its context includes lessons learned from previous projects, which shapes its output before the quality analyzer even runs.

The transfer works because connascence patterns are largely project-independent. Connascence of algorithm is the same problem whether you’re building an auth system or a data pipeline. The specific modules change, but the structural pattern — duplicated logic that must stay synchronized — is universal.

Agent Learning Curves

I’ve tracked agent performance over time, and the data shows a clear learning curve.

Week 1: Average coupling score per generated module: 14.2. First-fix rate: 45%. Average convergence cycles: 2.8.

Week 4: Average coupling score: 11.7. First-fix rate: 62%. Average convergence cycles: 2.1.

Week 8: Average coupling score: 8.3. First-fix rate: 78%. Average convergence cycles: 1.4.

Week 12: Average coupling score: 6.9. First-fix rate: 84%. Average convergence cycles: 1.2.

The coupling score dropped 51% over 12 weeks. The first-fix rate nearly doubled. The average number of revision cycles dropped by more than half.

These improvements happened without changing the model. Same model, same weights, same capabilities. The difference is accumulated context — the agent has seen 12 weeks of its own mistakes and their corrections. Each mistake, once corrected and stored, becomes a pattern that prevents the same mistake in future generations.

The learning curve isn’t infinite. It plateaus around week 10-12 for most violation types. The agent has learned the common patterns and the remaining violations are genuinely novel situations where coupling is hard to avoid. That plateau is useful information too — it tells me the baseline quality level I can expect from this agent on this type of work.

Fix Success Rate Tracking

Not all fixes are successful. Sometimes the agent’s revision makes things worse — it reduces one type of coupling while introducing another. Sometimes the revision is correct but incomplete — it fixes the flagged coupling but misses a related instance.

I track fix success rate across three dimensions:

Correctness: Did the revision actually reduce the flagged coupling? Success rate: 89%.

Completeness: Did the revision address all instances of the flagged pattern, or just the one mentioned in the feedback? Success rate: 72%. This is the weakest dimension — agents tend to fix the specific instance mentioned in feedback without checking for similar instances elsewhere in the file.

Regression: Did the revision introduce new coupling while fixing old coupling? Regression rate: 11%. Most regressions are connascence of name — the agent renames a function in one place without updating all callers.

The completeness gap is the most actionable finding. It tells me that feedback needs to be explicit about scope: “This pattern appears in 3 locations in this file. Fix all 3, not just the one shown above.” Adding that instruction improved completeness from 72% to 85%.

Why This Matters Beyond My System

Dogfooding is underrated in AI systems because most people think of AI as a tool, not a product. You use a tool. You improve a product. The distinction matters.

When you treat your AI as a tool, you accept its output quality as fixed. The model generates what it generates. If it’s not good enough, you switch to a better model.

When you treat your AI as a product, you instrument it, measure it, and improve it. The model’s base capability is the starting point, not the ceiling. Feedback loops raise the ceiling over time.

The connascence analyzer is one quality tool. Code review tools, test coverage analyzers, security scanners, and performance profilers all work the same way. Run the tool on the AI’s output, feed the results back to the AI, and watch the output improve.

The key constraint is that the feedback must be specific and actionable. “Your code has problems” doesn’t help. “Your code has connascence of algorithm between these two functions, and here’s how to fix it” does. The quality of the feedback loop determines the rate of improvement.

The Recursive Angle

There’s a recursive element that I find genuinely interesting. The connascence analyzer is itself code. Some of that code was generated by AI agents. Those agents were improved by the connascence analyzer.

The tool tests the maker that tests the tool.

This isn’t circular reasoning — it’s convergent quality. Each cycle tightens the feedback. Better analyzer code catches more violations. Catching more violations produces better feedback. Better feedback produces better code generation. Better code generation produces better analyzer code.

The convergence has a limit. The system can’t improve itself beyond the quality ceiling set by the underlying model and the quality metrics being measured. But within those bounds, the improvement is real and measurable.

I’ve watched the connascence analyzer’s own codebase improve through this process. Early versions had connascence of position in their test fixtures (tests depended on argument order). The analyzer flagged its own tests. The AI agent fixed them. Now the analyzer’s tests use named parameters exclusively. The tool ate its own dog food and came out healthier.

Building AI systems that improve themselves? I help teams design feedback loops between AI output and quality analysis. Book a call to discuss self-improving AI architectures.