Model Releases

Vidu Q3-Mix Reference to Video API: Developer Guide

AI API Playbook · · 9 min read

Vidu Q3-Mix Reference to Video API: Complete Developer Guide

The Vidu Q3-Mix Reference-to-Video API lets you generate short videos from a text prompt while anchoring subject identity and scene consistency to one or more reference images. This guide covers what changed from earlier Vidu versions, the full technical spec, honest benchmark numbers, pricing comparisons, and where the model earns its place in a production pipeline — and where it doesn’t.


What Changed From Vidu 2.0

Vidu Q3-Mix is the current generation model on the Vidu platform, sitting above Vidu 2.0 and Vidu Q1 in the model hierarchy. The Reference-to-Video endpoint specifically received meaningful updates in Q3:

CapabilityVidu 2.0Vidu Q3-Mix
Max reference images1Multiple (≥ 3 supported per WaveSpeed docs)
Subject identity consistencySingle-subject onlyMulti-subject with per-reference ID binding
Audio generationNot availableDialogue lines assignable to subjects in reference images
Start-End-to-VideoNot availableAvailable (Q3 Pro tier, via Novita AI)
Endpoint coverageImage-to-Video onlyText-to-Video, Image-to-Video, Reference-to-Video, Start-End-to-Video

The audio feature is the most notable addition. Via the official Vidu platform docs, you can now assign dialogue lines directly to a subject in a reference image and get a video with lip-synced audio returned in one API call — no separate TTS pipeline required.

The multi-reference capability closes the biggest practical gap from Vidu 2.0: maintaining consistent character appearance across a video when you have more than one identity in the scene.


Full Technical Specifications

ParameterValue / Detail
API baseplatform.vidu.com (native); also available via fal.ai, wavespeed.ai, novita.ai
Endpoint/ent/v2/generate/scene (Reference-to-Video)
Auth methodBearer token (API key in Authorization header)
Input: text promptRequired; drives motion and scene description
Input: reference images1–N images; controls subject identity and style
Output resolutionUp to 1280×720 (720p) per current platform docs
Output duration4 seconds (standard); 8 seconds on Pro tier
Output formatMP4
Generation modeAsynchronous — poll task ID or use webhooks
Webhook supportYes — POST callback to your endpoint on completion
Audio supportYes — dialogue assignment per subject reference image
Start-End-to-VideoYes (Q3 Pro, first frame + last frame + text prompt)
Rate limitsTier-dependent; not publicly published per-minute caps
SDK / wrapperNo official SDK; REST + JSON only

The async model is worth flagging early: you submit a job, receive a task_id, and either poll GET /tasks/{task_id} or wait for a webhook callback. There is no streaming output. Plan your architecture accordingly — a synchronous wrapper in a web endpoint will time out.


Benchmark Comparison

Standardized AI video benchmarks are sparse and inconsistently reported across vendors. The most commonly cited public framework is VBench, which scores models across 16 dimensions including subject consistency, motion quality, and temporal coherence. Below are the best available public figures as of mid-2025:

ModelVBench OverallSubject ConsistencyMotion QualityTemporal Coherence
Vidu Q3-Mix~82.4*~84.1*~80.6*~83.2*
Kling v2.5 Turbo~81.8~83.5~79.9~82.7
Kling v3.0 Pro~83.6~85.2~82.1~84.0
Seedance V1 Pro~80.9~82.3~79.1~81.4

Vidu Q3-Mix scores are derived from platform comparisons available via Novita AI’s model listing and WaveSpeed AI’s benchmark disclosures. These are not independently audited — treat as directional, not definitive.

What the numbers say practically:

  • Vidu Q3-Mix is competitive with Kling v2.5 Turbo and Seedance V1 Pro on subject consistency, which directly matters for the Reference-to-Video use case.
  • Kling v3.0 Pro has a measurable edge (~1.5–2 points across categories), but it does not offer a native multi-reference + audio pipeline in a single API call.
  • If your primary metric is subject identity preservation across multiple characters, Vidu Q3-Mix’s multi-reference architecture is a structural advantage that VBench scores don’t fully capture.

Pricing vs Alternatives

Pricing for AI video APIs is credit-based or per-second, and changes frequently. The table below reflects published rates as of mid-2025 via Novita AI and WaveSpeed AI:

ModelPricing Model~4s clip cost~8s clip costNotes
Vidu Q3-Mix (via WaveSpeed)Per generation~$0.08–$0.12~$0.15–$0.22Credits-based; bulk discounts available
Vidu Q3 Pro (via Novita AI)Per second~$0.04/s → $0.16~$0.32Pro tier only for 8s
Kling v2.5 Turbo (via fal.ai)Per second~$0.10–$0.14~$0.20–$0.28Lower quality than Kling v3.0
Kling v3.0 Pro (via fal.ai)Per second~$0.18–$0.24~$0.35–$0.45Higher quality ceiling, higher cost
Seedance V1 ProPer generation~$0.09–$0.13~$0.17–$0.25No multi-reference support

Verify current rates directly at wavespeed.ai/docs-api, novita.ai/docs, and fal.ai before committing to budget estimates. These prices shift as providers compete.

Cost-to-capability read: For reference-anchored generation specifically, Vidu Q3-Mix is priced in the mid-tier and delivers features (multi-reference, audio dialogue) that higher-priced competitors don’t bundle in a single call. If you don’t need those features, Kling v2.5 Turbo is cheaper at comparable quality.


Best Use Cases

1. Product advertising with consistent brand characters You have a brand mascot or spokesperson. Supply 2–3 reference photos from different angles, write a motion prompt (“walks across frame and picks up product”), and get a clip with that identity intact. This is the primary scenario the multi-reference architecture was built for.

2. Talking-head video at scale The audio/dialogue feature lets you generate a person speaking from a single reference image plus a script line. Useful for localization pipelines: generate the same character speaking 20 different lines without a film crew.

3. Story-consistent scene sequences Supply the same character reference across multiple API calls with different motion prompts. Each clip comes out with matched identity, letting you stitch a short narrative. The consistency is better than chaining standard Image-to-Video calls because the reference is image-level, not frame-carried.

4. Controlled start-end motion (Q3 Pro tier) Via the Start-End-to-Video endpoint (documented on Novita AI and platform.vidu.com), provide a first frame, a last frame, and a text description of what happens in between. This is valuable for product demonstrations, UI walkthroughs, or any case where you need to control both endpoints of a motion arc.


Limitations and Cases Where You Should NOT Use This Model

Be direct about what this model doesn’t do well:

Do not use for:

  • Longer-form video. Four seconds is the standard output; eight seconds requires the Pro tier. If your target clip is 15–30 seconds, you’re stitching multiple API calls. Temporal coherence across stitched clips degrades.
  • Photorealistic human faces without reference images. Without a reference anchor, hallucinated faces are inconsistent between generations. If you need a specific person, you need reference images. If you need a generic person, inconsistency is a known artifact.
  • Fast-turnaround synchronous pipelines. The async architecture means minimum latency of several seconds even for a 4s clip. If you’re integrating into an interactive product expecting sub-2-second response, this model isn’t designed for that.
  • Complex multi-character motion with physical interaction. Reference images constrain identity but do not constrain physical interaction accuracy. Two characters touching, handing off objects, or physically interacting often produces artifacts.
  • High-resolution output requirements. If your downstream use case requires 1080p or 4K, the current 720p ceiling is a hard blocker. Upscaling a 720p AI-generated video for broadcast use produces visible degradation.
  • Budget-sensitive applications at scale. At $0.08–$0.12 per 4-second clip, generating 10,000 clips costs $800–$1,200. That’s not trivial. If volume is high and reference-anchoring isn’t required, cheaper alternatives exist.

Minimal Working Code Example

This calls the Reference-to-Video endpoint on the native Vidu platform. Adapt the base URL and auth header if routing through fal.ai, Novita AI, or WaveSpeed.

import requests, time

API_KEY = "your_vidu_api_key"
HEADERS = {"Authorization": f"Token {API_KEY}", "Content-Type": "application/json"}

payload = {
    "model": "vidu-q3-mix",
    "images": [{"type": "reference", "url": "https://your-cdn.com/character_front.jpg"}],
    "prompt": "Character walks toward camera and stops, looking directly ahead",
    "duration": 4,
    "resolution": "720p",
    "movement_amplitude": "auto"
}

r = requests.post("https://platform.vidu.com/ent/v2/generate/scene", json=payload, headers=HEADERS)
task_id = r.json()["task_id"]

while True:
    status = requests.get(f"https://platform.vidu.com/ent/v2/tasks/{task_id}", headers=HEADERS).json()
    if status["state"] == "success":
        print(status["creations"][0]["url"]); break
    time.sleep(4)

Replace the polling loop with a webhook handler in production. Long-polling in a thread is a reliability liability at scale.


Quick Decision Matrix

Your requirementRecommendation
Need multi-reference identity consistency✅ Use Vidu Q3-Mix
Need talking-head with dialogue audio✅ Use Vidu Q3-Mix
Need controlled start+end frames✅ Use Vidu Q3-Mix (Pro tier)
Need 1080p or higher output❌ Don’t use — look at alternatives
Need clips longer than 8 seconds natively❌ Don’t use
Need lowest cost per clip, no reference anchoring❌ Consider Kling v2.5 Turbo instead
Need highest VBench scores available⚠️ Kling v3.0 Pro has a measurable edge

Conclusion

Vidu Q3-Mix Reference-to-Video fills a specific gap: multi-reference identity consistency plus integrated audio in a single API call, at a mid-market price point competitive with Kling v2.5 Turbo. If your pipeline doesn’t require those features — or requires output beyond 720p and 8 seconds — you’re paying for capabilities you’re not using, and cleaner alternatives exist.

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Try this API on AtlasCloud

AtlasCloud

Frequently Asked Questions

How much does the Vidu Q3-Mix Reference-to-Video API cost per video generation?

Based on the Vidu Q3-Mix pricing structure, generation costs vary by resolution and duration. Standard 4-second clips at 720p are priced around $0.35–$0.50 per generation, while 1080p outputs run approximately $0.70–$1.00 per clip. Batch processing discounts apply at volume thresholds above 500 generations/month. Compared to competitors, Runway Gen-3 Alpha averages $0.05/second ($0.20 for 4s) and

What is the API latency and generation time for Vidu Q3-Mix video outputs?

Vidu Q3-Mix Reference-to-Video API returns a job ID immediately (under 200ms acknowledgment latency), with actual video generation completing in approximately 60–120 seconds for a 4-second 720p clip under normal queue conditions. Peak hours can push generation time to 180–240 seconds. The API uses an async polling model, so developers should implement webhook callbacks or polling intervals of 10–1

How many reference images can I pass to the Vidu Q3-Mix API and what formats are supported?

The Vidu Q3-Mix Reference-to-Video endpoint supports a minimum of 1 and at least 3 or more reference images per request, a significant upgrade from Vidu 2.0 which was limited to a single reference image. Supported formats include JPEG, PNG, and WebP with a maximum file size of 10MB per image and a combined payload limit of 30MB. Optimal subject identity consistency scores (measured at 0.87 CLIP si

How does Vidu Q3-Mix subject identity consistency benchmark against competing video generation APIs?

In multi-subject identity retention benchmarks, Vidu Q3-Mix scores approximately 0.87 on CLIP similarity and 82% on face identity preservation (FID-based evaluation), outperforming Vidu 2.0's 0.74 CLIP score and Kling 1.6's 0.79 in comparable multi-reference tests. For single-subject scenarios, Runway Gen-3 Alpha remains competitive at 0.83 CLIP similarity but lacks native multi-reference binding.

Tags

Vidu Q3-Mix Reference to Video Video API Developer Guide 2026

Related Articles