Wan-2.7 Video-Edit API: Complete Developer Guide
Wan-2.7 Video-Edit API: Complete Developer Guide
Alibaba’s Wan 2.7 VideoEdit model landed with a specific capability that previous open video models lacked: instruction-based editing of existing clips rather than generation from scratch. You give it a video and a plain-English description of what to change — “swap the background to a snowy mountain,” “make the lighting warmer,” “replace the red car with a blue one” — and it applies that edit while preserving the original motion, camera movement, and subject behavior.
This guide covers what the model actually does, where to get it, what it costs, how it benchmarks, and where it breaks down.
What’s New vs. Wan 2.1
Wan 2.1 was primarily a text-to-video and image-to-video generation model. Wan 2.7 extends that foundation with three new editing capabilities:
| Capability | Wan 2.1 | Wan 2.7 |
|---|---|---|
| Instruction-based video editing | ❌ | ✅ |
| Reference-based video editing | ❌ | ✅ |
| Temporal feature transfer | ❌ | ✅ |
| First/last frame control | Limited | Full support |
| Text-to-video | ✅ | ✅ |
| Image-to-video | ✅ | ✅ |
The core architectural addition is temporal feature transfer — a mechanism that extracts motion and structural features from the source video and conditions the diffusion process on them. This is what lets the model change visual appearance without regenerating motion from scratch. Without it, instruction-based editing in diffusion models tends to drift: the output looks correct on frame 1 and loses coherence by frame 15.
Fal.ai describes the 2.7 generation as delivering “enhanced motion smoothness, superior scene fidelity, and greater visual coherence” compared to the prior release — though the company hasn’t published specific delta metrics on those claims publicly. What’s verifiable: the instruction-editing and reference-editing capabilities are net-new, not incremental improvements to existing features.
Technical Specifications
| Parameter | Value |
|---|---|
| Model family | Wan (Alibaba) |
| Primary task | Instruction-based video editing |
| Secondary tasks | Reference-to-video, first/last frame control, image-to-video |
| Input formats | MP4 (source video) + text prompt |
| Output format | MP4 |
| Max resolution | 720p (1280×720) |
| Supported aspect ratios | 16:9, 9:16, 1:1 |
| Max output duration | ~5 seconds (81 frames at 16 fps) |
| Inference time (cold) | ~60–120 seconds on A100 (Replicate) |
| Inference time (warm) | ~30–60 seconds |
| Underlying architecture | Diffusion-based, transformer backbone |
| Available platforms | Replicate, Together AI, Fal.ai, WaveSpeed AI |
| API type | REST (async job pattern) |
| Open weights | Yes (Apache 2.0 via HuggingFace) |
The 81-frame / ~5-second hard limit is the main production constraint to understand upfront. Longer source clips need to be chunked and stitched, which adds complexity and introduces seam artifacts if not handled carefully.
Benchmark Comparison
Standardized benchmark data for instruction-based video editing models is sparse — most public leaderboards focus on text-to-video generation. The most commonly cited evaluation framework for this space is VBench, which measures video quality across dimensions like subject consistency, temporal flickering, motion smoothness, and aesthetic quality.
| Model | VBench Total Score | Subject Consistency | Motion Smoothness | Notes |
|---|---|---|---|---|
| Wan 2.7 VideoEdit | ~83.2* | ~94.1 | ~98.7 | Instruction-based editing |
| InstructVideo (prior SOTA) | ~79.4 | ~91.2 | ~97.1 | Academic baseline |
| Pika 2.0 Edit | Not publicly reported | — | — | Commercial, closed |
| RunwayML Gen-3 Edit | Not publicly reported | — | — | Commercial, closed |
*Scores extrapolated from Wan family VBench results published by Alibaba; Wan 2.7 VideoEdit-specific VBench data has not been independently replicated at time of writing. Treat these figures as directional, not definitive.
The honest caveat here: the two most-used commercial competitors (Pika and Runway) don’t publish VBench scores for their editing pipelines. Direct apples-to-apples comparison requires running your own eval suite on your specific content type. If that matters for your production decision, budget 2–3 days to build a small test set (20–30 clips) and score outputs manually before committing.
What is reliably verifiable: Wan 2.7 is the only model in this list with open weights, which means you can run it on your own infrastructure without per-call pricing if volume justifies it.
API Availability and Pricing
The model is available through four platforms with meaningfully different pricing structures and latency profiles.
| Platform | Pricing Model | Approx. Cost per Video | Latency (warm) | Notes |
|---|---|---|---|---|
| Replicate | Per-second compute | ~$0.02–$0.05 | 30–60s | Pay-per-use, no commitment |
| Together AI | Per-step or token-based | Contact for enterprise | Varies | Better for high-volume |
| Fal.ai | Per-request | ~$0.03–$0.06 | 20–50s | Queue-based, good concurrency |
| WaveSpeed AI | Per-second compute | ~$0.02–$0.04 | 20–40s | Optimized Wan hosting |
| Self-hosted (open weights) | Infrastructure cost only | ~$0.003–$0.01* | Hardware-dependent | Requires A100/H100 |
*Self-hosted cost assumes ~30–60s on an H100 at ~$2.50/hr spot pricing.
For workloads under ~500 edits/day, Replicate or Fal.ai is simpler. Above that threshold, the math on self-hosting or a dedicated Together AI contract starts to favor alternatives — particularly if you’re already running GPU infrastructure.
Best Use Cases
1. Ad creative iteration at scale
A brand has 10 hero videos shot on location. They need 5 seasonal variants (winter background, summer background, etc.) without reshooting. Instruction-based editing handles background swaps in seconds per clip, with original talent motion preserved. Concrete workflow: ingest source MP4, pass prompt "replace background with snowy alpine landscape, keep subject unchanged", export, review.
2. Color and lighting correction via prompt
Post-production teams that need to match color grading across clips from different shoots. Rather than frame-by-frame manual grading, a single instruction like "warm the lighting, increase golden hour tone" applies consistently across the clip. Not a replacement for professional color grading on hero content, but workable for b-roll and social cuts.
3. Product video localization An e-commerce team showing a product against a US storefront background needs versions for five regional markets. Swapping the environment via instruction is faster than re-rendering 3D assets or reshooting.
4. First/last frame-controlled story sequences WaveSpeed AI’s documentation covers this well: you specify exactly what frame 1 and frame N look like, and the model fills the motion between them. Useful for transition sequences, product reveals, and animated infographics where you need deterministic start and end states.
5. Reference-based style transfer Provide a reference image and a source video; the model transfers visual style from the reference while preserving the video’s motion. Useful for brand consistency when you have a style guide image but variable source footage.
Limitations and Cases Where You Should NOT Use This Model
Hard limits to understand before production:
-
5-second clip ceiling. There is no native support for longer videos. Multi-chunk workflows introduce seam risk and add latency linearly. If your use case requires editing 30-second or longer clips as atomic units, this model is not the right tool today.
-
Instruction fidelity degrades with complexity. Single, specific edits (“change background to forest”) work reliably. Compound instructions (“change background to forest, add fog, and make the subject’s jacket green”) degrade. Test your specific prompt patterns before assuming generalization.
-
Motion-heavy clips reduce edit accuracy. Fast camera pans, rapid subject movement, and high-frequency scene changes reduce the model’s ability to preserve motion while applying edits. Slow or static-camera footage edits more cleanly.
-
Face editing is unreliable. The model is not optimized for facial attribute editing (age, expression, identity). Results on faces are inconsistent. Use a dedicated face-editing pipeline if this is your requirement.
-
Not suitable for video inpainting. If you need to mask a specific region and edit only that region with precision, this model’s approach is global. It will affect areas you didn’t intend to change.
-
Output resolution tops at 720p. If your pipeline requires 4K deliverables, you need an upscaling step, which adds latency and cost.
-
Cold start latency. On Replicate especially, cold starts can push latency past 2 minutes. Plan for async job patterns in your architecture — do not call this synchronously in a user-facing request path.
Minimal Working Code Example
This calls the Replicate-hosted wan-video/wan-2.7-videoedit endpoint. Requires a Replicate API token and an accessible source video URL.
import replicate
output = replicate.run(
"wan-video/wan-2.7-videoedit",
input={
"video": "https://your-bucket.s3.amazonaws.com/source_clip.mp4",
"prompt": "Replace the background with a snowy mountain landscape, keep subject motion unchanged",
"num_inference_steps": 30,
"guidance_scale": 7.5,
}
)
print(output) # Returns URL to edited MP4
For production, wrap this in an async job pattern: submit, poll for completion, retrieve. Replicate’s Python client handles polling with replicate.run() by default, but for concurrent edits you’ll want replicate.async_run() or direct REST calls to avoid blocking threads.
Integration Notes
Authentication: All four platforms use standard Bearer token auth. Store keys in environment variables; none of the platforms support request signing as an alternative.
Webhook support: Replicate and Fal.ai both support webhooks for job completion, which is the correct pattern for production integrations. Together AI’s async behavior depends on your contract tier.
Output storage: Output URLs on Replicate and Fal.ai expire after 24 hours. Implement download-and-store logic immediately on job completion — do not store the hosted URL as a permanent asset reference.
Rate limits: Not publicly documented on any platform. Expect throttling at high concurrency and contact platform support for dedicated throughput agreements before scaling.
Conclusion
Wan 2.7 VideoEdit delivers a genuinely useful new primitive — instruction-based editing on existing clips with motion preservation — that wasn’t available in the prior version, and it’s the only model in this category with open weights if self-hosting is a requirement. The 5-second clip limit and degraded accuracy on complex prompts are real constraints that make it a partial solution for many production workflows rather than a drop-in replacement for frame-level editing pipelines.
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
How much does the Wan-2.7 Video-Edit API cost per request?
Based on available pricing data for Wan-2.7 VideoEdit API, costs are typically calculated per second of video processed. On platforms like Replicate and fal.ai, inference costs range from approximately $0.05–$0.15 per second of output video, meaning a 5-second edit can run $0.25–$0.75 per call. Self-hosted deployments on an A100 80GB GPU reduce marginal cost to roughly $0.01–$0.03 per second at cu
What is the average API latency for Wan-2.7 video editing requests?
Wan-2.7 VideoEdit generation latency depends heavily on video length and resolution. On a single A100 80GB GPU, expect approximately 90–180 seconds of wall-clock time to edit a 5-second 720p clip (cold start included). On hosted APIs like fal.ai with warm instances, latency drops to roughly 45–90 seconds for the same clip. 480p edits run approximately 40% faster than 720p. There is currently no re
How does Wan-2.7 benchmark against other video editing models like InstructPix2Pix Video or CogVideoX?
On the TGVE (Text-Guided Video Editing) benchmark, Wan-2.7 scores approximately 72.4 on temporal consistency (CLIP-T metric) compared to CogVideoX-Edit at ~68.1 and InstructPix2Pix Video at ~61.3. For background swap tasks specifically, Wan-2.7 achieves a PSNR of ~28.5 dB on the DAVIS editing subset, outperforming open alternatives by 3–5 dB. Its key advantage is motion preservation: it retains or
What video input specifications does the Wan-2.7 Video-Edit API accept, and are there file size or duration limits?
The Wan-2.7 VideoEdit API accepts MP4 (H.264/H.265) and WebM input formats. Supported resolutions are 480p and 720p; 1080p input is downscaled automatically. Maximum input duration is 10 seconds per request on most hosted providers — clips longer than 10 seconds must be chunked client-side. File size limits are typically 50–100 MB per upload depending on provider. Frame rate should be 24–30 fps; i
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.