Workflow Guide
AI Podcast Production with Versely: 90-Min Episode in 30 Minutes (2026)
The complete 30-minute 2026 workflow to ship a research-backed 90-minute podcast episode with AI: outline, voice clone, multi-host dialogue, cover art, social clips, and RSS publish.
A staggering 619.2 million people listened to a podcast in the last month, up from 506.9 million in 2023, and the global podcast market is projected to grow from $47.75 billion in 2025 to $171.45 billion by 2030 at a 28.9% CAGR (XtendedView, 2026). That is the largest creator-economy land grab still open in audio, and in 2026 the people winning it are not the ones with the best mics — they are the ones who can ship a polished, research-backed 90-minute episode before lunch.
This is the workflow. Thirty minutes, end to end, using Versely as the orchestration layer and a small constellation of AI models behind it. No studio. No editor. No co-host scheduling. One operator, one laptop, one episode shipped to RSS, Spotify, and YouTube before the kettle finishes boiling.
Why AI podcast production hit critical mass in 2026
Three trends collided in the last twelve months and they explain why this is the year solo podcasters can finally outproduce networks.
The economics flipped. The AI-generated podcast host market alone grew from $1.57B in 2025 to $2.04B in 2026 at a 30.1% CAGR, on track for $5.81B by 2030 (Research and Markets, 2026). Multilingual AI dubbing, the single most commercially significant trend, has driven a 28% YoY increase in cross-border listenership for the top 200 US shows — at $40–$80 per episode versus $400 in 2023.
The tooling matured. 40% of all podcasters now use AI for editing, transcription, or post-production, and that number jumps to 67% among professional creators. 34% of all new podcasts launched in 2026 use AI tools somewhere in production, reducing total production time by up to 70% (Podcast Studio Hire, 2026).
The voice quality crossed the uncanny line. ElevenLabs v3 handles pauses, breathing, and emotional intonation well enough that 41% of Fortune 500 companies are now ElevenLabs customers and "casual listeners often cannot distinguish AI narration from human recording" (The AI Entrepreneurs, 2026). The same applies to multi-host dialogue: NotebookLM's Audio Overviews, which now support 80+ languages, an Interactive Join Mode, and as of March 2026 Cinematic Video Overviews, generate hour-long conversational episodes from documents in one click.
The result is that for the first time, a single 30-minute creator session can credibly produce what used to be a five-person production week. The rest of this post is the working timeline.
The 30-minute Versely podcast workflow at a glance
| Step | Time | Tool | Output |
|---|---|---|---|
| 1. Outline + script | 6 min | Versely agentic chat (Claude/Gemini) | 90-min, 18-section script |
| 2. Voice clone | 4 min | ElevenLabs v3 via Versely voice library | Cloned host voice + co-host |
| 3. Multi-host dialogue render | 9 min | Versely podcast renderer | Two-host MP3, 90 min |
| 4. Cover art | 3 min | Nano Banana Pro / Imagen 4 | 3000x3000 cover image |
| 5. Episode clips for social | 5 min | Lipsync + auto-captions | 4 short-form videos |
| 6. Publish to RSS + Spotify | 3 min | Versely RSS push | Live episode |
Thirty minutes. Read it as a rough budget — most experienced operators run closer to 22 minutes once they have their voice clone, intro music, and template assets cached.
Step 1: Outline and script in Versely chat (6 minutes)
Open Versely's agentic chat and paste your raw inputs — a couple of articles, a transcript, your own notes, a YouTube link. The chat is configured with Claude 4.5 and Gemini 3 Pro under the hood and treats your uploads as the canonical source of truth for the episode.
The prompt I run, almost verbatim:
Build a 90-minute, two-host podcast outline from these sources. Format: cold open hook (90s), 4 thematic acts of 5 segments each, mid-roll ad break after Act 2, listener Q&A segment, outro CTA. Each segment should be 250–350 spoken words with explicit
[HOST_A]and[HOST_B]tags. Add one disagreement per act so the hosts have real tension. Cite each factual claim with a numbered source reference at the end of the segment.
That single prompt gets you a ~12,500-word two-voice script. The "one disagreement per act" line is load-bearing — without it the hosts agree on everything and the episode sounds like a press release. Tension is what keeps a listener past minute 20.
Edit pass: skim the outline for hallucinated facts, fix the cold open hook, and lock the CTA. Six minutes including the read-through.
Step 2: Voice clone (4 minutes)
If you have not cloned your voice yet, do it once and never again. ElevenLabs offers two tiers: Instant Voice Cloning needs only 1–2 minutes of clean audio (Starter plan), and Professional Voice Cloning takes 30+ minutes of audio for a near-perfect replica (Creator plan, $22/mo) (ElevenLabs docs, 2026).
For your co-host, you have three good options:
- Clone a second real person. A guest, a business partner, or a creator you collaborate with regularly. Get explicit written consent every time — this is non-negotiable in 2026.
- Use a Versely voice library preset. Versely's AI voice cloning tool ships with 800+ pre-cleared voices in 48+ languages, including 11 emotion modes for per-segment expression. Pick a voice with a complementary timbre to your own (warm baritone pairs well with a brighter alto).
- Generate a brand-new synthetic voice from a text persona description. This is the route most solo operators take when they do not want to feature a second real person.
The four-minute budget assumes you already have your own voice cloned and you spend that time only selecting and previewing the co-host.
Step 3: Multi-host dialogue render (9 minutes)
Drop the script from Step 1 into Versely's podcast renderer with both voices assigned. The renderer parses [HOST_A] and [HOST_B] tags, allocates each segment to the correct voice, inserts the breathing and pacing markers that v3 reads natively, and queues the render.
A 90-minute two-host episode renders in roughly 6–8 minutes of wall-clock time on the standard pipeline. While it runs, you can switch tabs and start Step 4.
Two production notes that will save you a re-render:
- Lock the mood per act. Use Versely's per-segment emotion controls to set Act 1 to "curious," Act 2 to "energetic," Act 3 to "reflective," Act 4 to "decisive." This is the single biggest factor in why your AI episode does or does not sound like a podcast rather than a synthesized audiobook.
- Insert real pauses, not commas. v3 honors explicit
[PAUSE 0.8s]markers far more naturally than punctuation-only pacing. Add one between every back-and-forth and before every act transition.
If you want a deeper voice-side primer, see how to clone your voice with AI and the 2026 voice and music models roundup.
Step 4: Cover art (3 minutes)
Episode-specific cover art lifts CTR on Spotify and Apple Podcasts roughly 15–25% over a static show-level image. Generate one per episode.
Open Versely's text-to-image tool, select either Nano Banana Pro (best for typography integration) or Imagen 4 Ultra (best for photographic realism), and run a prompt like:
Podcast cover art, square 1:1, two abstract microphones forming a face silhouette, deep navy background, electric coral accent, episode number "47" in bold sans-serif top-right, white show wordmark bottom, high contrast, suitable for thumbnail at 64px
Generate four variants, pick the strongest, upscale to 3000x3000, and export. Three minutes flat including the rejected variants.
Step 5: Episode clips for social (5 minutes)
A 90-minute episode is the long-form anchor; the social clips are how anyone discovers it. The goal is four 30–60 second clips per episode — one per platform — and Versely's pipeline produces them in parallel.
The flow:
- Open the rendered MP3 in Versely. Auto-transcribe (Scribe-class transcription is included). Mark four high-energy moments — usually the cold open, the per-act disagreements, and one big payoff line.
- For each clip, send the corresponding voice segment to the AI lipsync tool paired with a talking-head asset (your real face from a 30-second selfie video, or a Versely avatar). The lipsync engine will conform the avatar to the cloned audio.
- Drop the four lipsync videos into the AI caption generator for word-by-word burned-in captions — necessary because 85% of social audio plays muted.
- Export in 9:16 for TikTok / Reels / Shorts and 1:1 for LinkedIn / Twitter.
You can run this entire step inside a single Versely movie maker project, which keeps assets and captions synchronized. Five minutes for four clips because the lipsync renders run in parallel.
Step 6: Publish to RSS, Spotify, and beyond (3 minutes)
Versely's podcast publisher handles the mechanical part: ID3 tagging, RSS feed update, chapter markers, and direct push to Spotify for Podcasters, Apple Podcasts Connect, Amazon Music, and YouTube as a video podcast (auto-rendered against the cover art with waveform animation).
Three things you should always set before the publish button:
- Chapter markers at every act break. Listeners use them, and Spotify's algorithm rewards completion-by-chapter.
- Episode-level transcript uploaded with the episode. This is now an SEO and discoverability requirement on Apple and Spotify in 2026.
- One pinned social clip scheduled to drop within 30 minutes of episode publish, via the Versely scheduler.
Done. Three minutes. The RSS push propagates to most directories within 15 minutes.
Cost math: AI episode vs traditional production
This is where the workflow stops being a curiosity and starts being a strategy.
Traditional 90-minute episode:
- Audio engineer / editor: $250–$500 per episode for a well-produced 30-min show, scaling to roughly $750–$1,500 for 90 minutes (Podcast Studio Glasgow, 2026)
- Studio time: $100–$300
- Cover art designer: $50–$200
- Social clip editor: $150–$400
- Host time: 6–10 hours at any reasonable hourly rate
- Total per episode: $1,000–$2,500 plus your time
A business publishing daily episodes the traditional way spends $60,000–$216,000 per year on production alone (Chandler Nguyen, 2026).
AI workflow with Versely:
- Voice generation (~12,500 words at v3 rates): ~$3.50
- Cover art (4 variants + upscale): ~$1.20
- Lipsync clips (4 × 60s): ~$2.40
- Transcription, RSS push, scheduling: included in plan
- Total per episode: under $10
- Total operator time: 30 minutes
The same business producing daily episodes via the AI workflow lands at $270–$3,000 per year all in. That is not an incremental cost saving; that is a structural change to who can run a daily show.
| Production | Per episode | Per year (daily show) | Operator hours |
|---|---|---|---|
| Traditional studio | $1,000–$2,500 | $60K–$216K | 6–10 hrs/ep |
| AI workflow (Versely) | $5–$10 | $1.8K–$3.7K | 0.5 hr/ep |
The headline number from one AI podcast startup: a planned $1 cost per episode at 3,000 episodes per week across 5,000 shows (Hollywood Reporter, 2026). That is the floor the market is moving toward.
Why this workflow is faster in Versely than in a stack of point tools
You can absolutely run this workflow by gluing NotebookLM, ElevenLabs Studio, Descript, Midjourney, CapCut, and Buzzsprout together. Plenty of people do. The reason we built Versely is that the glue is where the half-hour becomes a half-day.
The points of friction Versely removes:
- One asset graph. Voice clones, generated images, lipsync clips, and the episode MP3 live in the same library. No re-uploading the same file to four tools.
- Native batch. Every step — image, voice, lipsync, caption — runs in parallel batches with one click. In a point-tool stack you wait for one render to start the next.
- One billing plane. v3 voice, Imagen 4, Nano Banana Pro, Suno music, and Scribe transcription are unified under your Versely usage. No per-tool credit reconciliation.
- Publish-side hooks. The RSS, Spotify, Apple, YouTube, and social-scheduler integrations live in the same UI as generation. You finish rendering and the publish button is in the same window.
If you want to see the rest of the audio-side toolchain, the AI tools for podcasters 2026 post breaks each component down individually.
FAQ
Q: Can listeners tell when a podcast is AI-generated? By 2026, voice quality has reached a point where casual listeners often cannot distinguish AI narration from human recording, and TTS models now handle subtle cues like laughter, hesitation, and emphasis (CapCut, 2026). The tell-tale signs that remain are unnatural pacing and "narrator voice." Both are solved by inserting explicit pause markers and locking emotion-per-act before render.
Q: Is using an AI voice on a podcast legal and ethical? Cloning your own voice is straightforward. Cloning anyone else's voice requires explicit, documented consent — the platforms enforce this, and Versely's voice library only ships pre-cleared synthetic voices. For disclosure, most major directories now recommend (and some require) a brief "this episode uses AI-generated audio" line in the show notes.
Q: How long does a real 90-minute episode actually take? Operator time is 25–35 minutes for someone with a cached voice clone, intro music, and cover-art template. The first episode you produce will take longer (closer to 90 minutes) because you are also setting up the voice library and the publishing connections. After episode 3 you will be at the 30-minute median.
Q: Do I need to record any of my own audio? Only the initial 1–2 minute sample for your voice clone (Instant Voice Cloning) or 30+ minute sample for Professional Voice Cloning. After that, no microphone is required for the workflow described above. Many creators do still record live cold opens and use the AI workflow only for the main body — that is a hybrid most listeners cannot detect.
Q: Will this workflow get an episode demonetized or removed from Spotify? No, provided the content is original, you own the voice rights, and you do not impersonate a real person without consent. Spotify, Apple Podcasts, and YouTube all explicitly allow AI-narrated podcasts in 2026 — what they do not allow is voice cloning of public figures without authorization. The same rules that apply to a deepfake video apply to a deepfake voice.
Ship your first episode this afternoon
The headline of this post is the 30-minute number. The real story is the cost line: under $10 per 90-minute episode, end to end, with a one-person operator. That changes which podcast ideas are viable.
If you have been sitting on a podcast concept for six months because you could not justify the production cost, the math is no longer the obstacle. Open Versely, paste your source material into chat, and the first episode is in your RSS feed before dinner.
Start with the text-to-image tool for your cover, the AI voice cloning tool for your hosts, and the AI lipsync tool for your social clips. The full pipeline is wired through Versely's agentic chat.
Sources:
- Podcast Statistics 2026: Growth, Trends, and Insights — XtendedView
- AI-Generated Podcast Host Global Market Report — The Business Research Company
- How to Grow, Monetize & Scale Your Podcast in 2026 Using AI — Podcast Studio Hire, London
- ElevenLabs in 2026: The Complete Guide to v3 — The AI Entrepreneurs
- Professional Voice Cloning Documentation — ElevenLabs
- AI Podcast vs. Traditional Production Cost Breakdown — Chandler Nguyen
- AI vs Traditional Podcast Editing 2026 — Podcast Studio Glasgow
- Best AI Tools for Podcast Dialogue 2026 — CapCut
- 5,000 Podcasts at $1 Cost Per Episode — Hollywood Reporter