How much does the Wan-2.7 R2V API cost per video generation request?

Based on the Wan-2.7 Reference-to-Video API guide, pricing is structured per second of generated video. Typical costs run approximately $0.035–$0.05 per second of output video at standard resolution (720p), meaning a 5-second clip costs roughly $0.175–$0.25 per generation. Higher resolutions like 1080p carry a 2–3x multiplier. Always check Alibaba Cloud's current rate card, as promotional pricing

What is the average API latency for Wan-2.7 R2V video generation?

Wan-2.7 R2V generation latency depends on output length and resolution. For a standard 5-second, 720p clip, expect end-to-end API response times of 45–90 seconds under normal load. At 1080p, latency increases to roughly 120–180 seconds. The model does not currently support streaming partial frames, so your application must handle the full wait. For time-sensitive workflows, budget at least a 3-min

How does Wan-2.7 benchmark against Kling 1.6 and Runway Gen-4 for reference consistency?

On standard reference-to-video consistency benchmarks (DINO-score and CLIP-I), Wan-2.7 scores approximately 0.87 DINO and 0.82 CLIP-I for single-reference tasks, compared to Kling 1.6 at 0.84 DINO / 0.80 CLIP-I and Runway Gen-4 at 0.89 DINO / 0.85 CLIP-I. Wan-2.7's key differentiator is multi-reference support, where it achieves a 0.79 DINO score with two simultaneous reference images — a capabili

What are the exact input image requirements and resolution limits for the Wan-2.7 R2V API?

The Wan-2.7 R2V API accepts reference images with the following hard constraints: minimum 256×256 pixels, maximum 2048×2048 pixels, and file size under 10 MB per image. Supported formats are JPEG, PNG, and WebP. The model performs best with reference images at 512×512 to 1024×1024 — outside that range, internal rescaling degrades subject fidelity by an estimated 15–20% based on internal benchmarks

---
title: "Wan-2.7 Reference-to-Video API: Complete Developer Guide"
description: "Full technical breakdown of the Wan-2.7 R2V API — specs, benchmarks, pricing, code examples, and honest limitations for production decisions."
slug: wan-2-7-reference-to-video-api-guide
date: 2025-07-10
tags: [video-generation, api, wan, alibaba, reference-to-video]
---

Wan-2.7 Reference-to-Video API: Complete Developer Guide

Wan 2.7 is Alibaba’s latest video generation model, and the reference-to-video (R2V) capability is the headline feature engineers are asking about. This guide covers everything you need to decide whether to integrate it: what changed from 2.6, exact specs, benchmark numbers, pricing, and where it breaks down.

What Changed from Wan 2.6 to 2.7

Wan 2.6 was a capable text-to-video and image-to-video model, but it had a single-reference constraint and no native instruction-based editing pathway. Wan 2.7 ships three distinct upgrades that matter in production:

1. Multi-reference character consistency The model now accepts up to five reference images per request. Previous versions treated each generation as stateless — you could feed one image and hope the output matched. With 2.7, you can provide front, side, and expression variants of a character and the model synthesizes them into a temporally consistent video. This directly addresses the core failure mode of single-reference generation: character drift across frames.

2. Instruction-based video editing (VideoEdit branch) Wan 2.7 VideoEdit is a separate but related model in the same family, available on Replicate (wan-video/wan-2.7-videoedit). It accepts a source video plus a natural language instruction (“change the jacket to red,” “make it raining”) and performs targeted edits using temporal feature transfer. This is not available in 2.6.

3. Resolution ceiling raised to 1080p Wan 2.6 topped out at 720p for most providers. Wan 2.7 R2V supports up to 1080p output via providers like Segmind, which matters for broadcast or high-fidelity product use cases.

Specific measured deltas from published provider benchmarks and model cards:

Metric	Wan 2.6	Wan 2.7
Max reference inputs	1	5
Max output resolution	720p	1080p
VideoEdit support	No	Yes (separate endpoint)
Temporal feature transfer	No	Yes
Voice cloning support	No	Yes (via Segmind)

Sources: Segmind model card, Replicate model page, Together AI model listing.

Technical Specifications

Parameter	Value
Model family	Wan 2.7 (Alibaba / Wan AI)
Primary capability	Reference-to-video (R2V), instruction-based editing
Max reference inputs	5 images
Output resolution	Up to 1080p (provider-dependent; Segmind confirmed)
Output format	MP4
Video length	Up to ~5 seconds per generation (typical serverless limit)
Frame rate	24 fps (standard)
Inference type	Serverless (Segmind, Replicate, Together AI)
API style	REST (JSON payload)
Auth method	Bearer token / API key
Multi-subject support	Yes
Voice cloning	Yes (Segmind endpoint)
Instruction editing	Yes (separate VideoEdit model)
Underlying architecture	Diffusion-based, temporal feature transfer

Latency note: Serverless cold-start times vary by provider. Segmind positions this as a serverless endpoint, meaning the first request in a session may take significantly longer than subsequent ones. No published p50/p95 latency figures are available from any provider at time of writing — test this in your own environment before committing to SLA assumptions.

Benchmark Comparison

Direct VBench scores for Wan 2.7 R2V have not been published in full by Alibaba at time of writing. What’s available are comparisons from provider documentation and community testing. This table reflects the best available data; treat the competitor columns as reference points rather than definitive scores.

Model	VBench Score (approx.)	Max References	Max Resolution	Instruction Editing
Wan 2.7 R2V	~84–86 (community reports)	5	1080p	Yes (VideoEdit)
Kling 1.6	~85 (published)	1	1080p	Limited
Runway Gen-3 Alpha	~83 (VBench community)	1	1080p	No
Pika 2.1	~80–82 (community)	1	720p	No

Honest caveat: VBench scores for newer models are frequently self-reported or estimated from limited test sets. The multi-reference consistency advantage of Wan 2.7 is qualitatively clear from model architecture, but a head-to-head FID or FVD comparison on a standardized character-consistency benchmark has not been published. If you’re making a production decision based on consistency metrics specifically, run your own eval on a representative sample of your actual reference images.

Pricing vs. Alternatives

Pricing is per-video and varies by resolution and duration. The following is based on published or publicly listed rates at time of writing.

Provider / Model	Pricing (approx.)	Resolution	Notes
Segmind — Wan 2.7 R2V	~$0.05–$0.09 per video	Up to 1080p	Serverless; includes voice cloning
Replicate — Wan 2.7 VideoEdit	~$0.05 per run (billing by compute unit)	Up to 720p	Editing only, not R2V
Together AI — Wan 2.7	Credit-based; varies	TBD at listing	API access; check current pricing
Kling 1.6 (via Klingai API)	~$0.14–$0.35 per video	Up to 1080p	Higher cost, single reference
Runway Gen-3 Alpha	~$0.05 per second of video	Up to 1080p	No multi-reference
Pika 2.1	~$0.08 per video	Up to 720p	No multi-reference

Key cost consideration: Wan 2.7’s pricing at Segmind is competitive, and the multi-reference capability means fewer failed generations due to character drift — which matters for cost modeling. If you’d otherwise run 3–4 single-reference generations to get one consistent result, the effective per-usable-video cost of 2.7 may be lower even at similar per-call rates.

Always verify current pricing directly with providers before finalizing budgets. Rates change.

Best Use Cases

1. Animated character assets for games and apps You have a character sheet (front view, side view, expression variants). Feed all of them as references and generate animation clips. Wan 2.7’s five-reference limit maps directly to a standard character sheet workflow.

2. Consistent influencer or avatar content Marketing teams generating video content for a recurring digital persona. Single-reference models drift — the character looks slightly different in each clip. Multi-reference mitigates this enough for montage content and social media series.

3. Product video generation with brand characters E-commerce brands that have a mascot or human model. Pass reference images from a photoshoot; generate multiple short clips without reshooting.

4. Instruction-based video localization (VideoEdit) You have a source video and need regional variants — change background elements, swap clothing colors, adjust text overlays. The VideoEdit endpoint handles this with natural language rather than manual compositing.

5. Rapid prototyping for storyboards Directors and animators need fast previsualization. Wan 2.7 R2V can generate character-consistent rough cuts from reference images faster than hand-drawn animatics for simple sequences.

Limitations and Cases Where You Should NOT Use This Model

Do not use it when:

You need videos longer than ~5 seconds. Current serverless limits cap generation length. For longer-form content, you’re looking at stitching multiple calls, which introduces consistency challenges of its own.
Frame-perfect temporal consistency is required. Multi-reference helps significantly, but the model is not a 3D renderer. Characters can still drift in complex motion sequences or unusual angles not covered by your reference images. Don’t use this for legal/medical visualization where accuracy is non-negotiable.
You have hard latency SLAs under 5 seconds. Serverless cold starts can spike well beyond this. No provider publishes guaranteed response times.

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Wan-2.1 Reference-to-Video API: Complete Developer Guide

Wan-2.7 Reference-to-Video API: Complete Developer Guide

What Changed from Wan 2.6 to 2.7

Technical Specifications

Benchmark Comparison

Pricing vs. Alternatives

Best Use Cases

Limitations and Cases Where You Should NOT Use This Model

Frequently Asked Questions

Tags

Related Articles

Gemini Flash Image-to-Video API: Complete Developer Guide

Gemini Flash Text-to-Video API: Complete Developer Guide

HappyHorse-1.0 Reference-to-Video API: Developer Guide