Comparisons
Wan 2.7 vs LTXV2: Open-Source Video Models Compared 2026
Wan 2.7's Apache 2.0 release with native audio versus LTXV2's speed-quality curve. The honest open-source video comparison for self-hosting teams in 2026.
Wan 2.7 and LTXV2 are the two open-source video models doing real production work on Versely in 2026. Wan 2.7 dropped under Apache 2.0 in April with native audio and built-in voice cloning — a structural shift in what "open-source video" means. LTXV2 holds the speed-quality curve at the bulk tier and is the cheapest finished-second on rented compute. Picking between them isn't a binary — most teams running open infra route both — but the per-shot decision matters for cost and quality. This comparison walks through the capability surface and tells you which model wins per use case on Versely's AI video generator.
Wan 2.7 and LTXV2 are the two open-source models running production volume on Versely in 2026.
Quick verdict
If you need audio in the video — talking content, music-driven clips, anything where sound matters at generation — Wan 2.7. If you need cheapest finished-second on bulk silent b-roll with fast turnaround on mid-range GPUs — LTXV2. Wan 2.7 needs heavier compute (H100 / A100 class for production throughput); LTXV2 runs usefully on a 4090 or A6000. The Apache 2.0 licence on Wan 2.7 is genuinely permissive — fine-tune, redistribute, embed in downstream products with no per-call cost. LTXV2's open weights are similarly licence-clean for commercial use.
Capability comparison at a glance
| Capability | Wan 2.7 | LTXV2 |
|---|---|---|
| Text-to-video | Yes | Yes |
| Image-to-video | Yes | Yes |
| Native audio | Yes (Apr 2026) | No |
| Voice cloning | Yes (built-in, 30s reference) | No |
| Lipsync | Yes (good, behind VEO) | No |
| Reference / multi-image | Yes | Limited |
| First-last-frame | Yes | Limited |
| Extend video | Yes | Yes |
| Max clip length | 10s, extendable | 8s standard |
| Max resolution | 1080p | 1080p |
| Licence | Apache 2.0 | Open weights, commercial-clean |
| Min GPU (production) | H100 / A100 class | A6000 / 4090 viable |
| Inference speed (5s clip on H100) | ~75-110s | ~25-45s |
| Inference speed (5s clip on 4090) | Not viable for production | ~60-90s |
| Per-second cost (rented H100) | ~$0.018 | ~$0.008 |
| Per-second cost (rented A100) | ~$0.012 | ~$0.005 |
LTXV2 is roughly 2-3x cheaper per finished second on equivalent compute and roughly 3x faster on H100. Wan 2.7 brings audio, voice cloning and a broader capability surface that LTXV2 doesn't match. The choice maps cleanly to "do you need audio."
Where Wan 2.7 wins
Native audio with built-in voice cloning. This is the killer feature and the reason Wan 2.7 is genuinely different from prior open-source video models. Drop a 30-second voice reference, write your script, and Wan 2.7 generates the video with the cloned voice and workable lipsync in a single pass. No closed premium model offers voice cloning native to the video model itself — on Versely the closed-side equivalent requires routing through separate voice cloning and lipsync tools. Wan 2.7 collapses that pipeline.
Apache 2.0 licence. The licence is fully permissive — fine-tune, redistribute, embed, build downstream products. For agencies building custom-tuned models on client aesthetics, for SaaS embedding video generation in their own product, for teams shipping branded content at scale — Apache 2.0 is the only mainstream video model that lets you do this without licence concerns.
Broader capability surface. First-last-frame, extend, multi-reference conditioning. Wan 2.7 has the toolkit a serious production needs for sequence work. LTXV2 is closer to a single-shot generation tool.
Quality on talking content. Lipsync isn't VEO 3.1's phoneme-accurate level, but it's well ahead of any post-hoc pipeline run on silent footage. For UGC-style talking content at zero per-call cost (after compute), Wan 2.7 is the strongest open-source option.
Where LTXV2 wins
Cheapest finished-second. On rented H100 the per-second cost lands around $0.008 — roughly half of Wan 2.7 on the same hardware and a fraction of any closed model's per-second API rate. For high-volume bulk b-roll, LTXV2 is the right economic choice with no close runner-up.
Speed. A 5-second clip in 25-45 seconds on H100 is the fastest open-source video pipeline. For prompt iteration and high-throughput batches this matters enormously. Wan 2.7's 75-110s for the equivalent clip is materially slower.
Lower compute floor. LTXV2 runs usefully on a 4090 or A6000 — hardware that's accessible to small studios and individual creators who can't justify H100 rental. Wan 2.7 production throughput effectively requires H100 / A100 class hardware.
Image-to-video stability on simpler briefs. For converting a static reference image into clean motion on a simple visual brief, LTXV2 is fast and reliable. The simpler model tends to over-complicate less than Wan 2.7 on briefs that don't need the broader capability.
Bulk batch economics. When you're running 10K clips overnight on rented compute, the per-clip speed difference compounds into hours of wall-clock time and meaningful dollars on hourly GPU rates.
LTXV2 wins throughput economics; Wan 2.7 wins capability surface — the split is clean.
Use case by use case
The honest verdict, job by job:
- Talking content with cloned voice at scale: Wan 2.7. Native voice cloning is the moat.
- High-volume silent b-roll for YouTube: LTXV2. Cheapest finished-second.
- Visual hooks for TikTok / Reels at volume: LTXV2. Speed and cost win.
- UGC-style ad at zero per-call cost: Wan 2.7. Talking-head with cloned voice in one pass.
- Brand-fine-tuned model on agency client aesthetic: Wan 2.7 Apache 2.0. Fine-tune, deploy.
- Embedding video generation in a downstream SaaS product: Either, both are licence-clean. LTXV2 for cost-sensitive embeds.
- Long single-shot hero piece: Wan 2.7 with extend. LTXV2 caps at 8s standard.
- Multi-shot narrative with character continuity: Wan 2.7 reference conditioning is stronger.
- Music video with synced audio: Wan 2.7. LTXV2 is silent.
- Bulk image-to-video conversion of a product catalogue: LTXV2. Speed and cost both matter.
- Voice-driven explainer at zero per-call cost: Wan 2.7. The full pipeline collapses to one model call.
- Ultra-high-volume social automation (50K+ clips/month): LTXV2 self-hosted. Economics dominate everything.
Combined workflow via Versely
The teams running open infra on Versely don't pick one — they route both:
- LTXV2 for the bulk — silent b-roll, transitions, visual hooks, image-to-video conversions. Roughly 60-70% of typical production volume by clip count.
- Wan 2.7 for the talking and audio moments — UGC ads, narration, music-synced cuts, anything where audio is generated in the same pass. Roughly 20-30% of typical production volume.
- Wan 2.7 fine-tunes for brand-specific aesthetics — when you've trained a custom Wan 2.7 on a client's brand language, that's the hero-shot model for that client's work.
- LTXV2 for prompt iteration and batch testing — when you're throwing 8 prompt variants at a problem, LTXV2's speed makes the iteration cycle livable.
- Closed models for the remaining 5-10% — phoneme-precise lipsync briefs, stylized hero shots where Sora 2 is genuinely the right tool. Open-source isn't trying to win every shot.
Versely's AI movie maker sequences both open and closed models in a single timeline and Versely's AI b-roll generator automatically routes b-roll calls to LTXV2 when audio isn't required. For broader context on the open-versus-closed trade-off see our open-source vs closed AI video models guide, and for the wider 2026 model landscape see our mid-year roundup of what's new in 2026 and the best AI video generation models of 2026 ranking.
Apache 2.0 weights and licence-clean open weights enable production patterns closed models structurally can't.
FAQ
Can I run Wan 2.7 on a 4090?
Inference will work but production throughput is poor. A 5-second clip can take 4-6 minutes on a 4090 versus 75-110s on H100. For occasional generation it's viable; for production volume, rent H100 or A100 capacity.
Is LTXV2 going to add audio?
The Lightricks team has signalled audio is on the roadmap but no public release date as of mid-2026. Until then, LTXV2 + post-hoc audio via Versely's AI lipsync and voice cloning workflow is the path.
How does Wan 2.7's voice cloning compare to dedicated voice clone models?
For voice quality alone, dedicated voice cloning models (ElevenLabs-class) edge it. For workflow simplicity — generating video with the cloned voice and lipsync in a single model call — Wan 2.7 is unique. The trade-off depends on whether voice quality or pipeline simplicity matters more for your brief.
What's the break-even versus closed APIs?
Roughly: at ~30,000-50,000 generated seconds per month, rented H100 running Wan 2.7 or LTXV2 crosses below VEO 3.1's $0.12/sec API rate. Below that volume, closed APIs are usually cheaper once you factor infra ops time. Above it, open-source wins on economics alone.
Can I fine-tune Wan 2.7 on my own data?
Yes, the Apache 2.0 licence permits fine-tuning, distribution of fine-tunes and commercial use of fine-tuned outputs. This is the structural advantage open weights provide that closed models cannot match.
Closing takeaway
Wan 2.7 and LTXV2 split the open-source video tier on the same axis the broader market splits on: capability versus economics. Wan 2.7 brings audio, voice cloning, multi-reference and a broader feature surface at higher compute cost. LTXV2 brings the cheapest finished-second on the market at narrower capability. The teams running mature production infra on Versely in 2026 don't pick one — they route LTXV2 for the bulk silent volume and Wan 2.7 for the talking and audio moments, with closed premium models reserved for the 5-10% of shots where the cinematic or lipsync ceiling actually matters. The open-source video tier in 2026 is no longer a compromise — it's a real production tool with structural advantages closed models can't match. Treat it that way and the economics of high-volume premium-adjacent video finally work.