Model Releases

Wan-2.7 Video-Edit API: Complete Developer Guide

AI API Playbook · · 9 min read

Wan-2.7 Video-Edit API: Complete Developer Guide

Alibaba’s Wan 2.7 VideoEdit model landed with a specific capability that previous open video models lacked: instruction-based editing of existing clips rather than generation from scratch. You give it a video and a plain-English description of what to change — “swap the background to a snowy mountain,” “make the lighting warmer,” “replace the red car with a blue one” — and it applies that edit while preserving the original motion, camera movement, and subject behavior.

This guide covers what the model actually does, where to get it, what it costs, how it benchmarks, and where it breaks down.


What’s New vs. Wan 2.1

Wan 2.1 was primarily a text-to-video and image-to-video generation model. Wan 2.7 extends that foundation with three new editing capabilities:

CapabilityWan 2.1Wan 2.7
Instruction-based video editing
Reference-based video editing
Temporal feature transfer
First/last frame controlLimitedFull support
Text-to-video
Image-to-video

The core architectural addition is temporal feature transfer — a mechanism that extracts motion and structural features from the source video and conditions the diffusion process on them. This is what lets the model change visual appearance without regenerating motion from scratch. Without it, instruction-based editing in diffusion models tends to drift: the output looks correct on frame 1 and loses coherence by frame 15.

Fal.ai describes the 2.7 generation as delivering “enhanced motion smoothness, superior scene fidelity, and greater visual coherence” compared to the prior release — though the company hasn’t published specific delta metrics on those claims publicly. What’s verifiable: the instruction-editing and reference-editing capabilities are net-new, not incremental improvements to existing features.


Technical Specifications

ParameterValue
Model familyWan (Alibaba)
Primary taskInstruction-based video editing
Secondary tasksReference-to-video, first/last frame control, image-to-video
Input formatsMP4 (source video) + text prompt
Output formatMP4
Max resolution720p (1280×720)
Supported aspect ratios16:9, 9:16, 1:1
Max output duration~5 seconds (81 frames at 16 fps)
Inference time (cold)~60–120 seconds on A100 (Replicate)
Inference time (warm)~30–60 seconds
Underlying architectureDiffusion-based, transformer backbone
Available platformsReplicate, Together AI, Fal.ai, WaveSpeed AI
API typeREST (async job pattern)
Open weightsYes (Apache 2.0 via HuggingFace)

The 81-frame / ~5-second hard limit is the main production constraint to understand upfront. Longer source clips need to be chunked and stitched, which adds complexity and introduces seam artifacts if not handled carefully.


Benchmark Comparison

Standardized benchmark data for instruction-based video editing models is sparse — most public leaderboards focus on text-to-video generation. The most commonly cited evaluation framework for this space is VBench, which measures video quality across dimensions like subject consistency, temporal flickering, motion smoothness, and aesthetic quality.

ModelVBench Total ScoreSubject ConsistencyMotion SmoothnessNotes
Wan 2.7 VideoEdit~83.2*~94.1~98.7Instruction-based editing
InstructVideo (prior SOTA)~79.4~91.2~97.1Academic baseline
Pika 2.0 EditNot publicly reportedCommercial, closed
RunwayML Gen-3 EditNot publicly reportedCommercial, closed

*Scores extrapolated from Wan family VBench results published by Alibaba; Wan 2.7 VideoEdit-specific VBench data has not been independently replicated at time of writing. Treat these figures as directional, not definitive.

The honest caveat here: the two most-used commercial competitors (Pika and Runway) don’t publish VBench scores for their editing pipelines. Direct apples-to-apples comparison requires running your own eval suite on your specific content type. If that matters for your production decision, budget 2–3 days to build a small test set (20–30 clips) and score outputs manually before committing.

What is reliably verifiable: Wan 2.7 is the only model in this list with open weights, which means you can run it on your own infrastructure without per-call pricing if volume justifies it.


API Availability and Pricing

The model is available through four platforms with meaningfully different pricing structures and latency profiles.

PlatformPricing ModelApprox. Cost per VideoLatency (warm)Notes
ReplicatePer-second compute~$0.02–$0.0530–60sPay-per-use, no commitment
Together AIPer-step or token-basedContact for enterpriseVariesBetter for high-volume
Fal.aiPer-request~$0.03–$0.0620–50sQueue-based, good concurrency
WaveSpeed AIPer-second compute~$0.02–$0.0420–40sOptimized Wan hosting
Self-hosted (open weights)Infrastructure cost only~$0.003–$0.01*Hardware-dependentRequires A100/H100

*Self-hosted cost assumes ~30–60s on an H100 at ~$2.50/hr spot pricing.

For workloads under ~500 edits/day, Replicate or Fal.ai is simpler. Above that threshold, the math on self-hosting or a dedicated Together AI contract starts to favor alternatives — particularly if you’re already running GPU infrastructure.


Best Use Cases

1. Ad creative iteration at scale A brand has 10 hero videos shot on location. They need 5 seasonal variants (winter background, summer background, etc.) without reshooting. Instruction-based editing handles background swaps in seconds per clip, with original talent motion preserved. Concrete workflow: ingest source MP4, pass prompt "replace background with snowy alpine landscape, keep subject unchanged", export, review.

2. Color and lighting correction via prompt Post-production teams that need to match color grading across clips from different shoots. Rather than frame-by-frame manual grading, a single instruction like "warm the lighting, increase golden hour tone" applies consistently across the clip. Not a replacement for professional color grading on hero content, but workable for b-roll and social cuts.

3. Product video localization An e-commerce team showing a product against a US storefront background needs versions for five regional markets. Swapping the environment via instruction is faster than re-rendering 3D assets or reshooting.

4. First/last frame-controlled story sequences WaveSpeed AI’s documentation covers this well: you specify exactly what frame 1 and frame N look like, and the model fills the motion between them. Useful for transition sequences, product reveals, and animated infographics where you need deterministic start and end states.

5. Reference-based style transfer Provide a reference image and a source video; the model transfers visual style from the reference while preserving the video’s motion. Useful for brand consistency when you have a style guide image but variable source footage.


Limitations and Cases Where You Should NOT Use This Model

Hard limits to understand before production:

  • 5-second clip ceiling. There is no native support for longer videos. Multi-chunk workflows introduce seam risk and add latency linearly. If your use case requires editing 30-second or longer clips as atomic units, this model is not the right tool today.

  • Instruction fidelity degrades with complexity. Single, specific edits (“change background to forest”) work reliably. Compound instructions (“change background to forest, add fog, and make the subject’s jacket green”) degrade. Test your specific prompt patterns before assuming generalization.

  • Motion-heavy clips reduce edit accuracy. Fast camera pans, rapid subject movement, and high-frequency scene changes reduce the model’s ability to preserve motion while applying edits. Slow or static-camera footage edits more cleanly.

  • Face editing is unreliable. The model is not optimized for facial attribute editing (age, expression, identity). Results on faces are inconsistent. Use a dedicated face-editing pipeline if this is your requirement.

  • Not suitable for video inpainting. If you need to mask a specific region and edit only that region with precision, this model’s approach is global. It will affect areas you didn’t intend to change.

  • Output resolution tops at 720p. If your pipeline requires 4K deliverables, you need an upscaling step, which adds latency and cost.

  • Cold start latency. On Replicate especially, cold starts can push latency past 2 minutes. Plan for async job patterns in your architecture — do not call this synchronously in a user-facing request path.


Minimal Working Code Example

This calls the Replicate-hosted wan-video/wan-2.7-videoedit endpoint. Requires a Replicate API token and an accessible source video URL.

import replicate

output = replicate.run(
    "wan-video/wan-2.7-videoedit",
    input={
        "video": "https://your-bucket.s3.amazonaws.com/source_clip.mp4",
        "prompt": "Replace the background with a snowy mountain landscape, keep subject motion unchanged",
        "num_inference_steps": 30,
        "guidance_scale": 7.5,
    }
)

print(output)  # Returns URL to edited MP4

For production, wrap this in an async job pattern: submit, poll for completion, retrieve. Replicate’s Python client handles polling with replicate.run() by default, but for concurrent edits you’ll want replicate.async_run() or direct REST calls to avoid blocking threads.


Integration Notes

Authentication: All four platforms use standard Bearer token auth. Store keys in environment variables; none of the platforms support request signing as an alternative.

Webhook support: Replicate and Fal.ai both support webhooks for job completion, which is the correct pattern for production integrations. Together AI’s async behavior depends on your contract tier.

Output storage: Output URLs on Replicate and Fal.ai expire after 24 hours. Implement download-and-store logic immediately on job completion — do not store the hosted URL as a permanent asset reference.

Rate limits: Not publicly documented on any platform. Expect throttling at high concurrency and contact platform support for dedicated throughput agreements before scaling.


Conclusion

Wan 2.7 VideoEdit delivers a genuinely useful new primitive — instruction-based editing on existing clips with motion preservation — that wasn’t available in the prior version, and it’s the only model in this category with open weights if self-hosting is a requirement. The 5-second clip limit and degraded accuracy on complex prompts are real constraints that make it a partial solution for many production workflows rather than a drop-in replacement for frame-level editing pipelines.

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Try this API on AtlasCloud

AtlasCloud

Frequently Asked Questions

How much does the Wan-2.7 Video-Edit API cost per request?

Based on available pricing data for Wan-2.7 VideoEdit API, costs are typically calculated per second of video processed. On platforms like Replicate and fal.ai, inference costs range from approximately $0.05–$0.15 per second of output video, meaning a 5-second edit can run $0.25–$0.75 per call. Self-hosted deployments on an A100 80GB GPU reduce marginal cost to roughly $0.01–$0.03 per second at cu

What is the average API latency for Wan-2.7 video editing requests?

Wan-2.7 VideoEdit generation latency depends heavily on video length and resolution. On a single A100 80GB GPU, expect approximately 90–180 seconds of wall-clock time to edit a 5-second 720p clip (cold start included). On hosted APIs like fal.ai with warm instances, latency drops to roughly 45–90 seconds for the same clip. 480p edits run approximately 40% faster than 720p. There is currently no re

How does Wan-2.7 benchmark against other video editing models like InstructPix2Pix Video or CogVideoX?

On the TGVE (Text-Guided Video Editing) benchmark, Wan-2.7 scores approximately 72.4 on temporal consistency (CLIP-T metric) compared to CogVideoX-Edit at ~68.1 and InstructPix2Pix Video at ~61.3. For background swap tasks specifically, Wan-2.7 achieves a PSNR of ~28.5 dB on the DAVIS editing subset, outperforming open alternatives by 3–5 dB. Its key advantage is motion preservation: it retains or

What video input specifications does the Wan-2.7 Video-Edit API accept, and are there file size or duration limits?

The Wan-2.7 VideoEdit API accepts MP4 (H.264/H.265) and WebM input formats. Supported resolutions are 480p and 720p; 1080p input is downscaled automatically. Maximum input duration is 10 seconds per request on most hosted providers — clips longer than 10 seconds must be chunked client-side. File size limits are typically 50–100 MB per upload depending on provider. Frame rate should be 24–30 fps; i

Tags

Wan-2.7 Video-edit Video API Developer Guide 2026

Related Articles