Workflows
How to Make an AI VSL (Video Sales Letter) in 2026
Build a converting AI video sales letter in 2026: hook-pain-promise structure, voiceover pacing, 3-5s cuts, social proof segments and where to place the CTA.
A VSL — video sales letter — is the highest-converting long-form sales asset on the internet, and AI just dropped its production cost by two orders of magnitude. In 2026, a single operator can ship a 7-minute VSL on a Tuesday, run paid traffic to it Wednesday, and have a conversion-rate read by Friday. The old "two-week studio shoot, $15k creative budget" math is dead.
Target for this workflow: a 6-8 minute VSL with cuts every 3-5 seconds, 800-1100 word script, custom voiceover, AI-generated b-roll on every line, and a CTA placement strategy that doesn't leak conversions. Total production time: one day. Total tooling cost: under $40.
Step 1: Brief the offer before you brief the video
A VSL is downstream of the offer. If the offer is weak, no production craft will save you. Lock the offer first.
VSL brief
- Avatar: <single ICP, named>
- Pain (current state): <visceral, sensory sentence>
- Promise (desired state): <specific outcome, measurable>
- Mechanism (the unique "how"): <one sentence, contrarian if possible>
- Proof: <3 case studies, 3 numbers, 1 credentialed source>
- Offer: <price, payment terms, bonuses, guarantee>
- Risk reversal: <specific, asymmetric>
- CTA: <single, repeated 3x>
- Length target: 6-8 minutes
- Traffic source: <cold paid / warm email / retargeting>
The brief is the entire creative. The script writes itself once these slots are filled. Skip the brief and you'll script for 4 hours and rewrite for 8 more.
Prompt template for the script:
You are writing a 7-minute VSL for <product> targeting <avatar>.
Structure: HOOK (30s) -> PAIN (90s) -> PROMISE (60s) ->
MECHANISM (90s) -> PROOF (90s) -> OFFER (60s) -> RISK REVERSAL (30s)
-> CTA (30s). Pacing: sentence length under 12 words, one idea
per sentence, second-person, conversational, no jargon. The
mechanism must be contrarian — explicitly name the thing the
prospect is currently doing wrong. Output as timestamped sections
with line breaks between every sentence so the editor can match a
b-roll clip per line.
The "line break between every sentence" rule is what makes step 3 fast. Each sentence becomes a b-roll prompt.
Step 2: Script the structure that converts in 2026
Sample script for a fictional sleep supplement called "Drift":
HOOK (0-30s): If you wake up at 3am and stare at the ceiling, you're not broken. You're undersupplied. Here's the 90-second routine that fixed it for 11,000 people last year — and why every sleep app you've tried made it worse.
PAIN (30-120s): You go to bed exhausted. You fall asleep fast. And then, like clockwork, your eyes snap open at 3:17am. You reach for your phone. You tell yourself "just five minutes." It's 4:40am. Your alarm goes off in 90 minutes. You feel hollow before the day even starts.
PROMISE (120-180s): What if the 3am wake-up wasn't anxiety, wasn't stress, wasn't "getting older" — but a 4-cent mineral deficiency that 67% of adults have and don't know about?
MECHANISM (180-270s): Every sleep app, every meditation course, every "wind down routine" treats sleep as a behavior problem. It's not. It's a chemistry problem. And the specific chemistry — magnesium glycinate paired with apigenin at a 3:1 ratio — is what 40 years of clinical data point to. Nobody's talking about it because nobody can patent a mineral.
PROOF (270-360s): Sarah, 41, two kids: slept through the night for the first time in 6 years on day 4. Marcus, 58, executive: cut his Ambien dependency in 11 days. The clinical data: 3 randomized trials, 1,200 subjects, 73% sleep-through-night rate at week 4.
OFFER (360-420s): One bottle, 30 days, $39. Two bottles, $69 (save $9). The "first night" guarantee: if it doesn't hit on night one, the second bottle is free.
RISK REVERSAL (420-450s): 90-day money back, no questions, no return shipping. Keep the bottle. We've issued 12 refunds in 14 months.
CTA (450-480s): Tap the button. Pick the 30-day or 60-day. Ship today. Sleep through the night by Saturday.
That's the spine. Now storyboard a single visual for every sentence.
Step 3: Generate b-roll for every sentence — model picks per beat
The single biggest mistake in AI VSL production is generating one model for the whole video. The result is a 7-minute video that looks like a single seed — and the prospect's eye glazes by minute 2.
Rotate models by section purpose:
- HOOK b-roll: text-to-image with Midjourney v7 for the opening still ("a hand reaching for a phone in the dark, 3am clock visible"), animated with Kling 3.0 image-to-video for a slow push-in. Then VEO 3.1 for the "11,000 people" cinematic — a wide shot of a quiet bedroom at dawn.
- PAIN b-roll: AI b-roll generator with Wan 2.7 for stylized realism — the recurring "phone glow on a face" shot, the alarm clock, the empty coffee mug at 5am. Lean into the sensory: every pain sentence gets a tactile visual.
- MECHANISM b-roll: SORA 2 for any "physics-believable" shots — a pill dissolving in water, a cross-section of a brain receptor lighting up, mineral crystals under macro light. SORA 2 earns its credits on these specific shots.
- PROOF b-roll: Ideogram 3 for in-image typography (the "73% sleep-through-night" stat as a clean kinetic title card), and text-to-image with Flux 1.2 Ultra for the "Sarah, 41" avatar stills (clearly stylized to avoid faking real testimonials).
- OFFER b-roll: A clean product render generated with Midjourney v7 + Hailuo for a soft 360 turntable. The product is the visual anchor for the entire offer block.
- CTA b-roll: A single static product hero with a soft pulse animation. No motion competing with the viewer's attention to the button.
Sample prompts:
HOOK — Midjourney v7:
"Cinematic wide shot of a person sitting up in bed at 3am, blue
phone light on their face, dark bedroom, photorealistic, shallow
depth of field, 35mm, somber editorial mood."
PAIN — Wan 2.7:
"Close-up macro shot of an iPhone screen showing 4:42am, finger
about to swipe, slight motion blur, warm bedroom light spill,
shallow depth of field, 24fps."
MECHANISM — SORA 2:
"Macro slow-motion shot of a small white tablet dissolving in a
glass of water, particles dispersing in shafts of soft window
light, photorealistic, 4k cinematic."
PROOF — Ideogram 3:
"Clean kinetic title card on a deep navy background reading
'73% sleep-through-night rate at week 4', single source line in
small caps below, modern editorial typography, 16:9."
Generate 2 variants per sentence. You'll use roughly 75% of what you generate; the rest stocks the next VSL.
Step 4: Voiceover, pacing and the lip-sync question
Voice is 60% of VSL conversion. Get it right or burn the budget.
The right voice for the avatar. Direct-response data in 2026 is consistent: prospects convert higher with a voice that sounds like a slightly more polished version of themselves. A 41-year-old female avatar in a sleep VSL converts better with a 38-45 female voice than with a 28-year-old "professional VO" voice. Match age, gender and energy to the avatar — not to your aesthetic.
Two production paths:
- Clone a real voice once with AI voice cloning on ElevenLabs v3. If you have a presenter (founder, expert, on-camera spokesperson), this is the right call. Re-narrate every variant in 8 seconds.
- Pick a v3 library voice matched to the avatar. For early VSLs without a presenter, this is faster and equally converting if you pick well.
Pacing rules that hold:
- 155-170 wpm for the PAIN section. Slower than normal — you want the prospect to feel each line.
- 175-185 wpm for the MECHANISM. Slightly faster — the contrarian reveal benefits from momentum.
- 145-155 wpm for the OFFER and CTA. Slowest of all — every word in the price and guarantee must land.
- 0.6s silence before the price reveal. Same psychology as a comedian holding a beat — the silence forces attention.
Lip sync. Most VSLs in 2026 are voiceover-over-b-roll. No talking head, no lipsync needed. The exception: if you open with a 5-10 second presenter intro (real founder or AI avatar), use AI lipsync to lock the mouth to the cloned voice — mismatched lipsync in a sales context kills trust in under 2 seconds and the bounce never recovers.
Step 5: Music, captions, thumbnail — the conversion finishers
Music. Generate with AI music generator using Lyria for clean instrumental. Two beds, switched at the section boundaries:
- PAIN + PROMISE: somber piano + soft pad, -22dB under voice, no drums.
- MECHANISM + PROOF + OFFER: subtle pulse, low synth bass enters at 4:30, builds gently into the OFFER. The energy lift cues the prospect that the resolution is here.
Avoid drums in any VSL. Drums read as commercial; you want editorial trust.
Captions. Burn them in. Bottom third, two-line max, sentence case. Direct-response data in 2026 is unambiguous: VSLs with burned captions convert 18-26% higher than without, even when sound is on. The captions reinforce the line — your prospect reads and hears the same sentence simultaneously.
Thumbnail. Critical for paid traffic. AI thumbnail generator with three variants:
- Avatar's pain state ("3am stare at the ceiling")
- Mechanism reveal ("the 4-cent fix")
- Outcome ("slept through the night by Saturday")
Test all three on the ad surface. The winner is rarely the one you'd pick on aesthetic grounds.
Step 6: Final cut, CTA placement and platform-specific exports
Cuts every 3-5 seconds, no exceptions. Even when the voiceover is steady, the visual changes. The 3-5s cadence is what holds the prospect through 7 minutes of long-form video. A cut every 8 seconds is where bounce starts.
CTA placement — three placements, never more, never fewer:
- At the end of the OFFER block (around 7:00). The primary CTA. Button visible below the embedded video on the page.
- At the end of RISK REVERSAL (around 7:30). The secondary nudge — "tap the button now."
- At the final frame (7:45-8:00). Held static for 5 full seconds with the button arrow. The static hold is what gets the click.
Do not place a CTA before the OFFER block. Early CTAs leak conversions to under-warmed prospects who then never come back. Trust the structure: HOOK → PAIN → PROMISE → MECHANISM → PROOF carries the prospect to the OFFER. Don't ask for the click before you've earned it.
Exports:
- 16:9 1080p for landing page embed (primary).
- 9:16 1080x1920 with re-framed visuals for paid social ad surface.
- 1:1 1080x1080 for Facebook in-feed if you're running Meta ads.
- First 60 seconds as a standalone "ad-version" cut for cold traffic — the full VSL lives on the landing page after the click.
FAQ
How long should a VSL actually be in 2026?
Cold traffic VSLs: 6-9 minutes. Warm traffic (email list, retargeting): 4-6 minutes. The "30-minute long-form VSL" of the 2018 era is dead for most verticals — completion data shows a sharp drop after minute 9 across every offer category. The exception is high-ticket ($2k+) where the prospect needs more time to build conviction; there, 12-18 minutes still works.
Do I need to disclose that the b-roll is AI-generated?
Yes for the platforms that require it (Meta, TikTok, YouTube ads have AI-disclosure flags in 2026). Tag the upload accordingly. The disclosure does not hurt conversion in direct-response data — prospects in 2026 expect AI b-roll and don't penalize it. What hurts conversion: faked human testimonials. Never generate a fake testimonial — use stylized avatar stills, name them clearly as illustrative, or use real (consented) customer footage.
Should I split-test the hook or the offer first?
Hook, every time. The hook decides whether 60-80% of your traffic stays past 30 seconds. A 10% lift on the hook compounds against every downstream metric. Generate 5 hook variants, ship them as standalone ad-cuts, scale the winner. Then split-test the offer.
Can I reuse the same VSL across multiple ad creatives?
The VSL is the destination, not the ad. Build 8-12 short-form ads (30-60s each) that all funnel to the same VSL on the landing page. Each ad tests a different angle (pain frame, mechanism frame, social proof frame) — the VSL is the conversion machine they all feed. Don't burn the VSL by running it as the ad itself.
How often should I refresh the VSL?
Refresh the hook and pain block every 60-90 days as paid traffic fatigues those sections fastest. Keep the MECHANISM, PROOF and OFFER sections stable until conversion drops below your floor. Modular structure (named scenes, separate exports per section) makes a hook refresh a 90-minute job, not a full re-cut.
The compounding asset
A VSL is the single most valuable creative asset you'll build in any direct-response funnel. Every dollar of paid traffic in 2026 routes through one. Build it modularly, instrument it ruthlessly, and refresh the top 30% of the runtime every quarter.
Spin up AI video generator for the hero shots, AI b-roll generator for the per-sentence visuals, and AI voice cloning to lock the voice that converts your avatar. For the model selection logic per shot, the best AI video generation models for 2026 is the companion piece. For the short-form ad creatives that funnel into the VSL, run the viral short-form playbook.
Lock the offer. Ship the VSL. Refresh the hook every 60 days.