Content Pipeline: YouTube to Blog in 11 Phases

Every week, I watch 20+ hours of AI content. The pipeline turns that into blog posts — not by summarizing, but by synthesizing through a 3-model Byzantine debate.

Summarization is the cheap version of this problem. You take a video transcript, hand it to an LLM, and get back a shorter version of what was said. The result reads like a Wikipedia stub. No original thinking, no connections between sources, no voice.

Synthesis is what I actually want. Take 20 hours of content from 15 different creators, find the threads that connect them, identify where they agree, where they disagree, and where nobody is looking. Then write something new from that raw material.

That’s an 11-phase pipeline.

Phase 1: Curation

Not everything is worth processing. The pipeline starts with a curated list of YouTube channels and RSS feeds. I maintain the list manually — about 40 channels covering AI engineering, ML research, developer tooling, and adjacent topics.

Each week, new videos from these channels land in a queue. The queue has roughly 60-80 entries. The first filter is duration — anything under 5 minutes is usually promotional, anything over 3 hours is usually a livestream recording. Both get deprioritized. The sweet spot is 15-45 minute focused talks.

After duration filtering, the queue is typically 30-40 videos. That’s still too many to process fully, so a second filter uses the video title and description to estimate relevance. Anything directly related to my active topics (code governance, multi-agent systems, AI infrastructure) gets priority.

The result is 15-25 videos queued for full processing each week.

Phase 2: Download

yt-dlp handles the download. It’s the only tool I trust for this — robust, well-maintained, handles every edge case YouTube throws at it.

The pipeline downloads audio-only in the highest available quality. Video frames aren’t needed for transcription, and audio-only downloads are 10x smaller and 5x faster. Each download is stored with metadata: channel, title, upload date, duration, and URL for citation.

Phase 3: Transcription

Two transcription paths depending on content type.

For English-language technical talks with clear audio, Whisper runs locally. The large-v3 model gives word-level timestamps at roughly 10x realtime speed on my hardware. A 30-minute video transcribes in about 3 minutes.

For multi-speaker panels, heavy-accent content, or noisy audio, the pipeline routes to PodBrain. It handles speaker diarization (who said what) better than vanilla Whisper, and its noise cancellation catches things that Whisper hallucinates through.

Every transcript gets a confidence score. Below 0.85 confidence, the transcript is flagged for manual spot-checking. In practice, about 5% of transcripts need review.

Phase 4: Chunking and Tagging

Raw transcripts are chunked into semantic segments. Not fixed-size chunks — topic-boundary chunks. When the speaker shifts from discussing attention mechanisms to discussing deployment patterns, that’s a chunk boundary.

Each chunk gets tagged with topic labels, named entities (people, companies, papers, tools), and sentiment. The tags are what make cross-video synthesis possible. When three different creators talk about the same tool in the same week, those chunks cluster together.

Phase 5: Zeitgeist Analysis

This is where it gets interesting. The tagged chunks from all 15-25 videos feed into a weekly zeitgeist analysis. The system identifies:

Convergences: Topics that multiple creators are discussing independently. When 4 out of 15 creators all mention the same paper or technique in the same week, that’s a signal.

Divergences: Points where creators directly contradict each other. Creator A says fine-tuning is dead. Creator B says fine-tuning just got 10x cheaper. Both can’t be right — or maybe both are right in different contexts. Divergences are the best raw material for original writing.

Gaps: Topics that should be discussed but aren’t. If a major paper dropped and zero creators covered it, that’s either irrelevant or an opportunity.

Trends: How this week’s topics compare to the last 4 weeks. Is attention shifting from one area to another? Are certain topics losing interest or gaining momentum?

The zeitgeist analysis produces a structured report. That report is the input to the synthesis phase.

Phase 6: Byzantine Debate

This is the core of the pipeline. Three models — Gemini, Codex, and Claude — each independently analyze the zeitgeist report and produce draft insights.

Why three models? Same reason Byzantine fault-tolerant systems use 3f+1 nodes. If one model produces garbage (hallucinates, misinterprets, or generates slop), the other two catch it. Agreement among all three is high-confidence signal. Disagreement triggers investigation.

Each model receives the same zeitgeist report but different system prompts:

Gemini is prompted to be the researcher. It focuses on factual accuracy, source attribution, and connecting claims to evidence. Its draft reads like a research brief.
Codex is prompted to be the contrarian. It looks for unstated assumptions, weak arguments, and points where the conventional wisdom is probably wrong. Its draft reads like a critique.
Claude is prompted to be the synthesizer. It takes the strongest elements from both the research angle and the contrarian angle and produces a coherent narrative. Its draft reads like an editorial.

The three drafts go through a consensus round. Points that all three models agree on get marked as high confidence. Points where two agree and one dissents get marked as medium confidence with the dissent noted. Points where all three disagree get flagged for my manual review.

Phase 7: Draft Composition

The consensus output feeds into draft composition. This is where the raw insights become a blog post with structure, narrative arc, and voice.

The composition model (Claude, in this phase) receives:

The consensus insights with confidence levels
My style profile (a YAML file defining voice, banned phrases, paragraph length, sentence structure)
The target blog post format (hook, problem, insight, framework, implications, CTA)
Related existing posts (to avoid repetition and enable cross-linking)

The output is a complete first draft, typically 2000-3000 words.

Phase 8: Style Enforcement

The draft passes through a style enforcement gate. This is pattern matching against my YAML style profile.

The profile specifies:

First person, expert-informed, direct assertive voice
Maximum paragraph length: 3 sentences
Banned phrases: 26+ terms (the usual AI slop vocabulary)
Required structure: hook within first 2 paragraphs, CTA in final section
Citation format: inline links, not footnotes

Violations get corrected automatically. “This article delves into…” becomes a concrete opening statement. Five-sentence paragraphs get split. Missing CTAs get appended.

Phase 9: Slop Detection

The slop detection gate runs a 4-dimension scoring algorithm on the draft.

Lexical Score (0-25): Counts banned phrases, cliche patterns, and weak qualifiers. “Very unique” scores points. “Actually unique” doesn’t.

Statistical Score (0-40): Measures sentence length variance, vocabulary diversity, and phrase repetition. AI-generated text tends toward uniform sentence length and repeated structural patterns. Human-like text has higher variance.

Structural Score (0-20): Checks paragraph rhythm, section length balance, and transition quality. AI text often produces identically-structured paragraphs. Each section has an intro sentence, three supporting points, and a summary sentence. Real writing is less predictable.

Tonal Score (0-15): Measures hedging language (“it could be argued that”), corporate voice (“we are excited to announce”), and passive construction density.

Total score range: 0-100. Posts must score below 30% (a score of 30 or less) to pass. Above that threshold, the draft gets recycled back to the composition phase with specific feedback about what triggered the high score.

In practice, first drafts average a 35-40 slop score. After one revision cycle, they’re typically below 25.

Phase 10: Visual Art Composition

Each post needs a header image. The pipeline generates one using a 13-dimension composition system:

Color palette (mood-appropriate), focal point placement (rule of thirds), depth layering, texture contrast, negative space ratio, typography integration, brand consistency, aspect ratio targeting, compression-aware detail, dark/light mode compatibility, thumbnail readability, emotional resonance, and visual metaphor alignment.

The image generation prompt is built from the post’s topic, tone, and key concepts. A post about trading systems gets different visual treatment than a post about content pipelines.

Phase 11: Deployment

The finished post — markdown content plus header image — gets committed to the portfolio repository via git. The commit triggers a Railway auto-deployment. Within 3 minutes of the commit, the post is live on the production site.

The deployment pipeline runs one final validation: it builds the Astro site locally and checks that the new post renders correctly, all internal links resolve, and the content collection schema validates. If the build fails, the commit is rejected and I get an alert.

Why This Pipeline Exists

The pipeline exists because I was drowning in content consumption. I was watching 20+ hours per week and retaining maybe 10% of the useful insights. The rest evaporated.

Now, the pipeline retains everything. Every insight, every citation, every connection between topics gets captured in structured form. My blog posts are better because they synthesize across 15 sources instead of drawing from the 2-3 I could remember.

The Byzantine debate is the piece I’m most proud of. It catches errors I would have missed. When Gemini flags a factual claim that Claude stated confidently, and Codex agrees with Gemini, I know to double-check. That’s happened 11 times in the last month. I was wrong 8 of those 11 times.

The total pipeline runtime is about 45 minutes per post, from raw video to published content. That’s 45 minutes of compute time — I’m not sitting there watching it run. My actual time investment is about 20 minutes: curating the video list, reviewing flagged transcripts, and doing a final read of the published post.

Twenty minutes of my time for a synthesized, fact-checked, style-consistent, slop-free blog post drawn from 20 hours of source material. That’s the trade I’m making.

Building content pipelines or multi-model synthesis systems? I help teams design AI workflows that maintain quality at scale. Book a call to discuss your content architecture.