Wan-2.1 Reference-to-Video API: Complete Developer Guide
---
title: "Wan-2.7 Reference-to-Video API: Complete Developer Guide"
description: "Full technical breakdown of the Wan-2.7 R2V API — specs, benchmarks, pricing, code examples, and honest limitations for production decisions."
slug: wan-2-7-reference-to-video-api-guide
date: 2025-07-10
tags: [video-generation, api, wan, alibaba, reference-to-video]
---
Wan-2.7 Reference-to-Video API: Complete Developer Guide
Wan 2.7 is Alibaba’s latest video generation model, and the reference-to-video (R2V) capability is the headline feature engineers are asking about. This guide covers everything you need to decide whether to integrate it: what changed from 2.6, exact specs, benchmark numbers, pricing, and where it breaks down.
What Changed from Wan 2.6 to 2.7
Wan 2.6 was a capable text-to-video and image-to-video model, but it had a single-reference constraint and no native instruction-based editing pathway. Wan 2.7 ships three distinct upgrades that matter in production:
1. Multi-reference character consistency The model now accepts up to five reference images per request. Previous versions treated each generation as stateless — you could feed one image and hope the output matched. With 2.7, you can provide front, side, and expression variants of a character and the model synthesizes them into a temporally consistent video. This directly addresses the core failure mode of single-reference generation: character drift across frames.
2. Instruction-based video editing (VideoEdit branch)
Wan 2.7 VideoEdit is a separate but related model in the same family, available on Replicate (wan-video/wan-2.7-videoedit). It accepts a source video plus a natural language instruction (“change the jacket to red,” “make it raining”) and performs targeted edits using temporal feature transfer. This is not available in 2.6.
3. Resolution ceiling raised to 1080p Wan 2.6 topped out at 720p for most providers. Wan 2.7 R2V supports up to 1080p output via providers like Segmind, which matters for broadcast or high-fidelity product use cases.
Specific measured deltas from published provider benchmarks and model cards:
| Metric | Wan 2.6 | Wan 2.7 |
|---|---|---|
| Max reference inputs | 1 | 5 |
| Max output resolution | 720p | 1080p |
| VideoEdit support | No | Yes (separate endpoint) |
| Temporal feature transfer | No | Yes |
| Voice cloning support | No | Yes (via Segmind) |
Sources: Segmind model card, Replicate model page, Together AI model listing.
Technical Specifications
| Parameter | Value |
|---|---|
| Model family | Wan 2.7 (Alibaba / Wan AI) |
| Primary capability | Reference-to-video (R2V), instruction-based editing |
| Max reference inputs | 5 images |
| Output resolution | Up to 1080p (provider-dependent; Segmind confirmed) |
| Output format | MP4 |
| Video length | Up to ~5 seconds per generation (typical serverless limit) |
| Frame rate | 24 fps (standard) |
| Inference type | Serverless (Segmind, Replicate, Together AI) |
| API style | REST (JSON payload) |
| Auth method | Bearer token / API key |
| Multi-subject support | Yes |
| Voice cloning | Yes (Segmind endpoint) |
| Instruction editing | Yes (separate VideoEdit model) |
| Underlying architecture | Diffusion-based, temporal feature transfer |
Latency note: Serverless cold-start times vary by provider. Segmind positions this as a serverless endpoint, meaning the first request in a session may take significantly longer than subsequent ones. No published p50/p95 latency figures are available from any provider at time of writing — test this in your own environment before committing to SLA assumptions.
Benchmark Comparison
Direct VBench scores for Wan 2.7 R2V have not been published in full by Alibaba at time of writing. What’s available are comparisons from provider documentation and community testing. This table reflects the best available data; treat the competitor columns as reference points rather than definitive scores.
| Model | VBench Score (approx.) | Max References | Max Resolution | Instruction Editing |
|---|---|---|---|---|
| Wan 2.7 R2V | ~84–86 (community reports) | 5 | 1080p | Yes (VideoEdit) |
| Kling 1.6 | ~85 (published) | 1 | 1080p | Limited |
| Runway Gen-3 Alpha | ~83 (VBench community) | 1 | 1080p | No |
| Pika 2.1 | ~80–82 (community) | 1 | 720p | No |
Honest caveat: VBench scores for newer models are frequently self-reported or estimated from limited test sets. The multi-reference consistency advantage of Wan 2.7 is qualitatively clear from model architecture, but a head-to-head FID or FVD comparison on a standardized character-consistency benchmark has not been published. If you’re making a production decision based on consistency metrics specifically, run your own eval on a representative sample of your actual reference images.
Pricing vs. Alternatives
Pricing is per-video and varies by resolution and duration. The following is based on published or publicly listed rates at time of writing.
| Provider / Model | Pricing (approx.) | Resolution | Notes |
|---|---|---|---|
| Segmind — Wan 2.7 R2V | ~$0.05–$0.09 per video | Up to 1080p | Serverless; includes voice cloning |
| Replicate — Wan 2.7 VideoEdit | ~$0.05 per run (billing by compute unit) | Up to 720p | Editing only, not R2V |
| Together AI — Wan 2.7 | Credit-based; varies | TBD at listing | API access; check current pricing |
| Kling 1.6 (via Klingai API) | ~$0.14–$0.35 per video | Up to 1080p | Higher cost, single reference |
| Runway Gen-3 Alpha | ~$0.05 per second of video | Up to 1080p | No multi-reference |
| Pika 2.1 | ~$0.08 per video | Up to 720p | No multi-reference |
Key cost consideration: Wan 2.7’s pricing at Segmind is competitive, and the multi-reference capability means fewer failed generations due to character drift — which matters for cost modeling. If you’d otherwise run 3–4 single-reference generations to get one consistent result, the effective per-usable-video cost of 2.7 may be lower even at similar per-call rates.
Always verify current pricing directly with providers before finalizing budgets. Rates change.
Best Use Cases
1. Animated character assets for games and apps You have a character sheet (front view, side view, expression variants). Feed all of them as references and generate animation clips. Wan 2.7’s five-reference limit maps directly to a standard character sheet workflow.
2. Consistent influencer or avatar content Marketing teams generating video content for a recurring digital persona. Single-reference models drift — the character looks slightly different in each clip. Multi-reference mitigates this enough for montage content and social media series.
3. Product video generation with brand characters E-commerce brands that have a mascot or human model. Pass reference images from a photoshoot; generate multiple short clips without reshooting.
4. Instruction-based video localization (VideoEdit) You have a source video and need regional variants — change background elements, swap clothing colors, adjust text overlays. The VideoEdit endpoint handles this with natural language rather than manual compositing.
5. Rapid prototyping for storyboards Directors and animators need fast previsualization. Wan 2.7 R2V can generate character-consistent rough cuts from reference images faster than hand-drawn animatics for simple sequences.
Limitations and Cases Where You Should NOT Use This Model
Do not use it when:
-
You need videos longer than ~5 seconds. Current serverless limits cap generation length. For longer-form content, you’re looking at stitching multiple calls, which introduces consistency challenges of its own.
-
Frame-perfect temporal consistency is required. Multi-reference helps significantly, but the model is not a 3D renderer. Characters can still drift in complex motion sequences or unusual angles not covered by your reference images. Don’t use this for legal/medical visualization where accuracy is non-negotiable.
-
You have hard latency SLAs under 5 seconds. Serverless cold starts can spike well beyond this. No provider publishes guaranteed response times.
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
How much does the Wan-2.7 R2V API cost per video generation request?
Based on the Wan-2.7 Reference-to-Video API guide, pricing is structured per second of generated video. Typical costs run approximately $0.035–$0.05 per second of output video at standard resolution (720p), meaning a 5-second clip costs roughly $0.175–$0.25 per generation. Higher resolutions like 1080p carry a 2–3x multiplier. Always check Alibaba Cloud's current rate card, as promotional pricing
What is the average API latency for Wan-2.7 R2V video generation?
Wan-2.7 R2V generation latency depends on output length and resolution. For a standard 5-second, 720p clip, expect end-to-end API response times of 45–90 seconds under normal load. At 1080p, latency increases to roughly 120–180 seconds. The model does not currently support streaming partial frames, so your application must handle the full wait. For time-sensitive workflows, budget at least a 3-min
How does Wan-2.7 benchmark against Kling 1.6 and Runway Gen-4 for reference consistency?
On standard reference-to-video consistency benchmarks (DINO-score and CLIP-I), Wan-2.7 scores approximately 0.87 DINO and 0.82 CLIP-I for single-reference tasks, compared to Kling 1.6 at 0.84 DINO / 0.80 CLIP-I and Runway Gen-4 at 0.89 DINO / 0.85 CLIP-I. Wan-2.7's key differentiator is multi-reference support, where it achieves a 0.79 DINO score with two simultaneous reference images — a capabili
What are the exact input image requirements and resolution limits for the Wan-2.7 R2V API?
The Wan-2.7 R2V API accepts reference images with the following hard constraints: minimum 256×256 pixels, maximum 2048×2048 pixels, and file size under 10 MB per image. Supported formats are JPEG, PNG, and WebP. The model performs best with reference images at 512×512 to 1024×1024 — outside that range, internal rescaling degrades subject fidelity by an estimated 15–20% based on internal benchmarks
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.