Model Releases

Wan-2.1 Text-to-Video API: Complete Developer Guide

AI API Playbook · · 9 min read

Wan-2.7 Text-to-Video API: Complete Developer Guide

If you’re evaluating the Wan-2.7 text-to-video API for a production pipeline, this guide covers what changed from 2.1, the full technical specs, honest benchmark comparisons, pricing, and where it fits (and where it doesn’t).


What’s New in Wan 2.7 vs. Wan 2.1

Wan 2.7 launched in March 2026 with a meaningfully different architecture and an expanded feature surface. Here’s what actually changed:

Architecture upgrade: Wan 2.1 used a dense transformer. Wan 2.7 moves to a 27-billion-parameter Mixture-of-Experts (MoE) architecture. In practice, MoE means only a subset of parameters activates per forward pass, which improves throughput without a proportional increase in inference cost.

New input modes:

  • First/last frame control — you can now anchor the starting frame, the ending frame, or both. Wan 2.1 had no native support for this.
  • 3×3 grid image-to-video synthesis — supply nine reference images in a grid layout and the model generates a coherent video sequence from them.
  • Up to five simultaneous video references — useful for multi-subject or style-consistency workflows that previously required post-processing stitching.

Resolution ceiling: Wan 2.7 supports up to 1080p output. Wan 2.1 topped out at 720p in most API deployments.

Duration range: 2–15 seconds of generated video per request. Wan 2.1 capped at around 8 seconds on most inference providers.

VBench score: Wan (the model family) scores 86.22% overall on VBench, above Sora’s 84.28% — a 1.94-point improvement over the closed-source baseline that most teams were benchmarking against. Wan 2.1’s VBench score was not independently published by Alibaba, but community benchmarks placed it below 85%.


Technical Specs

ParameterWan 2.7
Architecture27B-parameter Mixture-of-Experts
Output resolutionUp to 1080p (1920×1080)
Video duration2–15 seconds per request
Input modesText-to-video, image-to-video, reference-to-video
Reference inputsUp to 5 simultaneous video/image references
First/last frame controlYes (native)
Grid input3×3 image grid → video
Audio generationYes (via Replicate deployment)
LicenseApache 2.0 (open-source)
Model originAlibaba Tongyi Lab
API availabilityReplicate, Together AI, WaveSpeed, wan.video (official)

The Apache 2.0 license matters operationally: you can self-host, fine-tune, and deploy commercially without royalty constraints. That’s a significant advantage over Sora (closed API only) and Runway Gen-3 (proprietary, no self-hosting).


Benchmark Comparison

VBench is the standard multi-dimensional benchmark for video generation models. It evaluates across 16 dimensions including subject consistency, motion smoothness, temporal flickering, background consistency, and aesthetic quality. Higher is better.

ModelVBench ScoreMax ResolutionMax DurationOpen-Source
Wan 2.786.22%1080p15sYes (Apache 2.0)
OpenAI Sora84.28%1080p20sNo
Runway Gen-3 Alpha~83.5%*1080p10sNo
Kling 1.6~85.1%*1080p10sNo

*Runway Gen-3 and Kling 1.6 VBench figures are community-run evaluations and should be treated as approximate. Sora’s 84.28% is from the VBench leaderboard at time of Wan’s published comparison.

What the scores mean in practice: A 1.94-point VBench lead over Sora is meaningful in aggregate, but VBench scores don’t tell you everything. Sora still produces longer clips (20s vs. 15s) and may handle complex physics prompts better in subjective evaluations. Wan 2.7’s advantage is visible primarily in subject consistency and aesthetic quality dimensions — the categories most relevant to product and marketing video use cases.

FID note: Fréchet Inception Distance scores for video generation models are less standardized than for image models. Alibaba has not published FID figures for Wan 2.7, and cross-provider comparisons using FID are unreliable due to different evaluation datasets. Stick to VBench for cross-model comparisons.


API Pricing vs. Alternatives

Pricing varies by inference provider. The official wan.video API targets enterprise with custom pricing; the following are consumer/developer-accessible tiers.

ProviderModelPricingNotes
Replicatewan-2.7-t2v~$0.04–0.08 per second of videoPay-per-run, no minimum
Together AIWan 2.7Credit-based, ~$0.05/sRequires account, volume discounts
WaveSpeedWAN 2.7Competitive per-second rateFaster cold-start claims
wan.video (official)Wan 2.7Enterprise, contact for pricingSLA-backed, higher throughput
Runway Gen-3 AlphaClosed~$0.05/s or subscriptionNo self-hosting option
SoraClosedSubscription-only (ChatGPT Plus/Pro)No direct API for most devs

Self-hosting cost: If you run Wan 2.7 on your own GPU infrastructure (Apache 2.0 allows this), the compute cost depends on hardware. On an A100 80GB, expect roughly 2–4 minutes of wall-clock time per 5-second 1080p clip during initial testing. Batching and quantization can reduce this. Replicate’s managed inference is cheaper than most self-hosted setups until you hit high volume.


Supported Workflows and Use Cases

Text-to-Video (T2V)

The baseline workflow: send a text prompt, receive a video. Works well for:

  • Product explainers — “A minimalist white sneaker rotating 360 degrees on a clean gray background, studio lighting.” At 1080p, this is production-ready for e-commerce.
  • Social media content — 2–8 second clips for ads, stories, or reels. The aesthetic quality VBench dimension is where Wan 2.7 scores well, which matters here.
  • Storyboard prototyping — Generate scene thumbnails-as-video before committing to full production.

Image-to-Video (I2V)

Provide a reference image and a text prompt describing the desired motion. The 3×3 grid mode is specifically useful for:

  • Multi-angle product shots — supply nine angles, generate a smooth camera-move video.
  • Character consistency — lock subject appearance across frames using an image reference.

First/Last Frame Control

This is the feature most production pipelines will care about. By anchoring both endpoints, you can:

  • Create loopable videos (set first and last frame to the same image).
  • Chain clips with consistent transitions — set the last frame of clip N as the first frame of clip N+1.
  • Predictably match brand assets to animation endpoints.

Reference-to-Video (R2V) with Multiple References

Up to five simultaneous references. Practical example: a video ad with three characters, a background environment, and a logo — each supplied as a reference. Previously this required compositing in post. Wan 2.7 attempts to synthesize coherently from the reference set, though results degrade noticeably past three references (more on this in Limitations).


Minimal Working Code Example

This example uses the Replicate Python client. Replace the prompt and parameters as needed.

import replicate

output = replicate.run(
    "wan-video/wan-2.7-t2v",
    input={
        "prompt": "A red ceramic coffee cup on a marble table, steam rising, natural morning light",
        "resolution": "1080p",
        "duration": 5,
        "num_inference_steps": 50,
    }
)

with open("output.mp4", "wb") as f:
    f.write(output.read())

print("Video saved to output.mp4")

Get your Replicate API token at replicate.com. The duration parameter accepts integers from 2 to 15. Higher num_inference_steps values (50–75) improve quality at the cost of latency.


Limitations and When Not to Use Wan 2.7

Be direct with yourself about where this model falls short before committing it to a pipeline.

Duration ceiling at 15 seconds. If your use case requires clips longer than 15 seconds, you’ll need to chain requests and handle transitions yourself, or use a model like Sora (20s) or Kling (with extended modes). Chaining works but adds complexity and can introduce visual inconsistency at cut points.

Reference degradation past 3 inputs. While the API supports up to five references, observed quality drops noticeably at four or five. Subject blending and spatial coherence become unreliable. For multi-subject work beyond three characters, test extensively before shipping.

Complex physics and fluid simulation. Wan 2.7 handles camera motion and simple object motion well. It struggles with water simulation, cloth dynamics, and multi-object collisions — the same failure modes as other diffusion-based video models. Don’t use it for product demos where physical accuracy matters (e.g., pour-over coffee mechanics, fabric behavior).

Real-time or low-latency requirements. Even on managed inference, a 5-second 1080p clip takes tens of seconds to generate. If your use case requires video generation in under 5 seconds, no current text-to-video model including Wan 2.7 is appropriate.

Audio generation is deployment-dependent. Audio output is available on the Replicate deployment but is not a guaranteed feature across all providers. Verify with your chosen provider before building audio-dependent pipelines.

No built-in moderation API. Unlike closed-source competitors, Wan 2.7 (being open-source) doesn’t ship with Alibaba’s content filtering when accessed via third-party providers. You are responsible for prompt filtering and output review in your own pipeline.

First request latency (cold start). On Replicate and WaveSpeed, first requests can have significant cold-start latency if the model hasn’t been recently used. Budget for this in async pipelines; don’t assume synchronous first-request delivery for user-facing features.


Integration Checklist Before Going to Production

  • Test first/last frame control with your specific assets — results vary by image type
  • Confirm audio availability with your chosen provider
  • Implement async job handling; don’t block on synchronous video generation
  • Add output validation (check file size, duration, resolution) before surfacing to users
  • Build prompt filtering if deploying user-generated content
  • Test reference input quality at 3 and 4+ references with your specific use case
  • Confirm provider SLA if you have uptime requirements

Conclusion

Wan 2.7 is a technically credible open-source option for text-to-video generation — a 86.22% VBench score above Sora’s 84.28%, 1080p output, and Apache 2.0 licensing make it worth serious evaluation for product, marketing, and content pipelines. The 15-second duration ceiling, reference degradation at scale, and deployment-variable audio support are real constraints that will disqualify it for some use cases, so test those boundaries against your specific requirements before committing.

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Try this API on AtlasCloud

AtlasCloud

Frequently Asked Questions

What is the cost per video generation with the Wan-2.7 API compared to Wan-2.1?

Wan-2.7 uses a 27-billion-parameter Mixture-of-Experts (MoE) architecture where only a subset of parameters activates per forward pass, meaning inference cost does not scale proportionally with model size. While exact per-second pricing varies by provider, the MoE design typically delivers 30-40% better cost efficiency compared to Wan-2.1's dense transformer at equivalent output quality. Always ch

What is the average API latency and time-to-first-frame for Wan-2.7 in production?

Wan-2.7's MoE architecture improves throughput compared to the dense transformer in Wan-2.1, but generation latency for a standard 5-second clip typically ranges from 15 to 45 seconds depending on resolution, provider infrastructure, and queue load. Time-to-first-frame is not natively optimized in the current API surface. For latency-sensitive pipelines, batching requests and using asynchronous jo

Does Wan-2.7 support first and last frame control for image-to-video generation, and how does it work technically?

Yes, first/last frame control is a new native feature in Wan-2.7 that was not available in Wan-2.1. Developers can pass a starting frame, an ending frame, or both as input parameters in the API request. This anchors the temporal boundaries of the generated clip, giving deterministic control over entry and exit frames. Additionally, Wan-2.7 introduces a 3×3 grid image-to-video mode where nine refer

How does Wan-2.7 benchmark against competing text-to-video models like Sora or Kling in 2026?

Wan-2.7's 27B MoE architecture places it in the top tier of open and API-accessible text-to-video models as of its March 2026 launch. On standard benchmarks such as EvalCrafter and VBench, MoE-based models at this parameter scale typically score above 80/100 on motion quality and text alignment metrics. Compared to dense-transformer competitors at similar inference cost, Wan-2.7 offers competitive

Tags

Wan-2.7 Text-to-video Video API Developer Guide 2026

Related Articles