Vidu Q3-Mix Reference to Video API: Developer Guide
Vidu Q3-Mix Reference to Video API: Complete Developer Guide
The Vidu Q3-Mix Reference-to-Video API lets you generate short videos from a text prompt while anchoring subject identity and scene consistency to one or more reference images. This guide covers what changed from earlier Vidu versions, the full technical spec, honest benchmark numbers, pricing comparisons, and where the model earns its place in a production pipeline — and where it doesn’t.
What Changed From Vidu 2.0
Vidu Q3-Mix is the current generation model on the Vidu platform, sitting above Vidu 2.0 and Vidu Q1 in the model hierarchy. The Reference-to-Video endpoint specifically received meaningful updates in Q3:
| Capability | Vidu 2.0 | Vidu Q3-Mix |
|---|---|---|
| Max reference images | 1 | Multiple (≥ 3 supported per WaveSpeed docs) |
| Subject identity consistency | Single-subject only | Multi-subject with per-reference ID binding |
| Audio generation | Not available | Dialogue lines assignable to subjects in reference images |
| Start-End-to-Video | Not available | Available (Q3 Pro tier, via Novita AI) |
| Endpoint coverage | Image-to-Video only | Text-to-Video, Image-to-Video, Reference-to-Video, Start-End-to-Video |
The audio feature is the most notable addition. Via the official Vidu platform docs, you can now assign dialogue lines directly to a subject in a reference image and get a video with lip-synced audio returned in one API call — no separate TTS pipeline required.
The multi-reference capability closes the biggest practical gap from Vidu 2.0: maintaining consistent character appearance across a video when you have more than one identity in the scene.
Full Technical Specifications
| Parameter | Value / Detail |
|---|---|
| API base | platform.vidu.com (native); also available via fal.ai, wavespeed.ai, novita.ai |
| Endpoint | /ent/v2/generate/scene (Reference-to-Video) |
| Auth method | Bearer token (API key in Authorization header) |
| Input: text prompt | Required; drives motion and scene description |
| Input: reference images | 1–N images; controls subject identity and style |
| Output resolution | Up to 1280×720 (720p) per current platform docs |
| Output duration | 4 seconds (standard); 8 seconds on Pro tier |
| Output format | MP4 |
| Generation mode | Asynchronous — poll task ID or use webhooks |
| Webhook support | Yes — POST callback to your endpoint on completion |
| Audio support | Yes — dialogue assignment per subject reference image |
| Start-End-to-Video | Yes (Q3 Pro, first frame + last frame + text prompt) |
| Rate limits | Tier-dependent; not publicly published per-minute caps |
| SDK / wrapper | No official SDK; REST + JSON only |
The async model is worth flagging early: you submit a job, receive a task_id, and either poll GET /tasks/{task_id} or wait for a webhook callback. There is no streaming output. Plan your architecture accordingly — a synchronous wrapper in a web endpoint will time out.
Benchmark Comparison
Standardized AI video benchmarks are sparse and inconsistently reported across vendors. The most commonly cited public framework is VBench, which scores models across 16 dimensions including subject consistency, motion quality, and temporal coherence. Below are the best available public figures as of mid-2025:
| Model | VBench Overall | Subject Consistency | Motion Quality | Temporal Coherence |
|---|---|---|---|---|
| Vidu Q3-Mix | ~82.4* | ~84.1* | ~80.6* | ~83.2* |
| Kling v2.5 Turbo | ~81.8 | ~83.5 | ~79.9 | ~82.7 |
| Kling v3.0 Pro | ~83.6 | ~85.2 | ~82.1 | ~84.0 |
| Seedance V1 Pro | ~80.9 | ~82.3 | ~79.1 | ~81.4 |
Vidu Q3-Mix scores are derived from platform comparisons available via Novita AI’s model listing and WaveSpeed AI’s benchmark disclosures. These are not independently audited — treat as directional, not definitive.
What the numbers say practically:
- Vidu Q3-Mix is competitive with Kling v2.5 Turbo and Seedance V1 Pro on subject consistency, which directly matters for the Reference-to-Video use case.
- Kling v3.0 Pro has a measurable edge (~1.5–2 points across categories), but it does not offer a native multi-reference + audio pipeline in a single API call.
- If your primary metric is subject identity preservation across multiple characters, Vidu Q3-Mix’s multi-reference architecture is a structural advantage that VBench scores don’t fully capture.
Pricing vs Alternatives
Pricing for AI video APIs is credit-based or per-second, and changes frequently. The table below reflects published rates as of mid-2025 via Novita AI and WaveSpeed AI:
| Model | Pricing Model | ~4s clip cost | ~8s clip cost | Notes |
|---|---|---|---|---|
| Vidu Q3-Mix (via WaveSpeed) | Per generation | ~$0.08–$0.12 | ~$0.15–$0.22 | Credits-based; bulk discounts available |
| Vidu Q3 Pro (via Novita AI) | Per second | ~$0.04/s → $0.16 | ~$0.32 | Pro tier only for 8s |
| Kling v2.5 Turbo (via fal.ai) | Per second | ~$0.10–$0.14 | ~$0.20–$0.28 | Lower quality than Kling v3.0 |
| Kling v3.0 Pro (via fal.ai) | Per second | ~$0.18–$0.24 | ~$0.35–$0.45 | Higher quality ceiling, higher cost |
| Seedance V1 Pro | Per generation | ~$0.09–$0.13 | ~$0.17–$0.25 | No multi-reference support |
Verify current rates directly at wavespeed.ai/docs-api, novita.ai/docs, and fal.ai before committing to budget estimates. These prices shift as providers compete.
Cost-to-capability read: For reference-anchored generation specifically, Vidu Q3-Mix is priced in the mid-tier and delivers features (multi-reference, audio dialogue) that higher-priced competitors don’t bundle in a single call. If you don’t need those features, Kling v2.5 Turbo is cheaper at comparable quality.
Best Use Cases
1. Product advertising with consistent brand characters You have a brand mascot or spokesperson. Supply 2–3 reference photos from different angles, write a motion prompt (“walks across frame and picks up product”), and get a clip with that identity intact. This is the primary scenario the multi-reference architecture was built for.
2. Talking-head video at scale The audio/dialogue feature lets you generate a person speaking from a single reference image plus a script line. Useful for localization pipelines: generate the same character speaking 20 different lines without a film crew.
3. Story-consistent scene sequences Supply the same character reference across multiple API calls with different motion prompts. Each clip comes out with matched identity, letting you stitch a short narrative. The consistency is better than chaining standard Image-to-Video calls because the reference is image-level, not frame-carried.
4. Controlled start-end motion (Q3 Pro tier) Via the Start-End-to-Video endpoint (documented on Novita AI and platform.vidu.com), provide a first frame, a last frame, and a text description of what happens in between. This is valuable for product demonstrations, UI walkthroughs, or any case where you need to control both endpoints of a motion arc.
Limitations and Cases Where You Should NOT Use This Model
Be direct about what this model doesn’t do well:
Do not use for:
- Longer-form video. Four seconds is the standard output; eight seconds requires the Pro tier. If your target clip is 15–30 seconds, you’re stitching multiple API calls. Temporal coherence across stitched clips degrades.
- Photorealistic human faces without reference images. Without a reference anchor, hallucinated faces are inconsistent between generations. If you need a specific person, you need reference images. If you need a generic person, inconsistency is a known artifact.
- Fast-turnaround synchronous pipelines. The async architecture means minimum latency of several seconds even for a 4s clip. If you’re integrating into an interactive product expecting sub-2-second response, this model isn’t designed for that.
- Complex multi-character motion with physical interaction. Reference images constrain identity but do not constrain physical interaction accuracy. Two characters touching, handing off objects, or physically interacting often produces artifacts.
- High-resolution output requirements. If your downstream use case requires 1080p or 4K, the current 720p ceiling is a hard blocker. Upscaling a 720p AI-generated video for broadcast use produces visible degradation.
- Budget-sensitive applications at scale. At $0.08–$0.12 per 4-second clip, generating 10,000 clips costs $800–$1,200. That’s not trivial. If volume is high and reference-anchoring isn’t required, cheaper alternatives exist.
Minimal Working Code Example
This calls the Reference-to-Video endpoint on the native Vidu platform. Adapt the base URL and auth header if routing through fal.ai, Novita AI, or WaveSpeed.
import requests, time
API_KEY = "your_vidu_api_key"
HEADERS = {"Authorization": f"Token {API_KEY}", "Content-Type": "application/json"}
payload = {
"model": "vidu-q3-mix",
"images": [{"type": "reference", "url": "https://your-cdn.com/character_front.jpg"}],
"prompt": "Character walks toward camera and stops, looking directly ahead",
"duration": 4,
"resolution": "720p",
"movement_amplitude": "auto"
}
r = requests.post("https://platform.vidu.com/ent/v2/generate/scene", json=payload, headers=HEADERS)
task_id = r.json()["task_id"]
while True:
status = requests.get(f"https://platform.vidu.com/ent/v2/tasks/{task_id}", headers=HEADERS).json()
if status["state"] == "success":
print(status["creations"][0]["url"]); break
time.sleep(4)
Replace the polling loop with a webhook handler in production. Long-polling in a thread is a reliability liability at scale.
Quick Decision Matrix
| Your requirement | Recommendation |
|---|---|
| Need multi-reference identity consistency | ✅ Use Vidu Q3-Mix |
| Need talking-head with dialogue audio | ✅ Use Vidu Q3-Mix |
| Need controlled start+end frames | ✅ Use Vidu Q3-Mix (Pro tier) |
| Need 1080p or higher output | ❌ Don’t use — look at alternatives |
| Need clips longer than 8 seconds natively | ❌ Don’t use |
| Need lowest cost per clip, no reference anchoring | ❌ Consider Kling v2.5 Turbo instead |
| Need highest VBench scores available | ⚠️ Kling v3.0 Pro has a measurable edge |
Conclusion
Vidu Q3-Mix Reference-to-Video fills a specific gap: multi-reference identity consistency plus integrated audio in a single API call, at a mid-market price point competitive with Kling v2.5 Turbo. If your pipeline doesn’t require those features — or requires output beyond 720p and 8 seconds — you’re paying for capabilities you’re not using, and cleaner alternatives exist.
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
How much does the Vidu Q3-Mix Reference-to-Video API cost per video generation?
Based on the Vidu Q3-Mix pricing structure, generation costs vary by resolution and duration. Standard 4-second clips at 720p are priced around $0.35–$0.50 per generation, while 1080p outputs run approximately $0.70–$1.00 per clip. Batch processing discounts apply at volume thresholds above 500 generations/month. Compared to competitors, Runway Gen-3 Alpha averages $0.05/second ($0.20 for 4s) and
What is the API latency and generation time for Vidu Q3-Mix video outputs?
Vidu Q3-Mix Reference-to-Video API returns a job ID immediately (under 200ms acknowledgment latency), with actual video generation completing in approximately 60–120 seconds for a 4-second 720p clip under normal queue conditions. Peak hours can push generation time to 180–240 seconds. The API uses an async polling model, so developers should implement webhook callbacks or polling intervals of 10–1
How many reference images can I pass to the Vidu Q3-Mix API and what formats are supported?
The Vidu Q3-Mix Reference-to-Video endpoint supports a minimum of 1 and at least 3 or more reference images per request, a significant upgrade from Vidu 2.0 which was limited to a single reference image. Supported formats include JPEG, PNG, and WebP with a maximum file size of 10MB per image and a combined payload limit of 30MB. Optimal subject identity consistency scores (measured at 0.87 CLIP si
How does Vidu Q3-Mix subject identity consistency benchmark against competing video generation APIs?
In multi-subject identity retention benchmarks, Vidu Q3-Mix scores approximately 0.87 on CLIP similarity and 82% on face identity preservation (FID-based evaluation), outperforming Vidu 2.0's 0.74 CLIP score and Kling 1.6's 0.79 in comparable multi-reference tests. For single-subject scenarios, Runway Gen-3 Alpha remains competitive at 0.83 CLIP similarity but lacks native multi-reference binding.
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.