Vidu Q3 Reference to Video API: Complete Developer Guide
Vidu Q3 Reference to Video API: Complete Developer Guide
The Vidu Q3 Reference to Video API lets you anchor video generation to one or more reference images — controlling character appearance, scene composition, or object identity — while still driving motion and narrative through a text prompt. If you’ve been evaluating text-to-video APIs and found pure prompt-based generation too unpredictable for production, reference-based generation is the workflow worth understanding.
This guide covers the technical spec, how reference-to-video differs from Vidu’s earlier models, where it sits against competitors, and what it actually costs.
What’s New in Vidu Q3 vs. Q1
Vidu has shipped two public API generations before Q3: Q1 and Q1 Turbo. The jump to Q3 is not cosmetic.
| Capability | Vidu Q1 | Vidu Q3 |
|---|---|---|
| Max resolution | 720p | 1080p |
| Resolution options | 720p only | 540p, 720p, 1080p |
| Anime-style generation | Not supported | Supported natively |
| Background music generation | Not supported | Supported |
| Reference image input | Single image | Multiple reference images |
| Turbo variant | Q1 Turbo available | Q3 Turbo available |
Three of those changes matter practically:
1080p output closes the gap with Kling and Runway for use cases requiring broadcast or high-DPI display output. At 720p you’re borderline; at 1080p you have room to crop or reframe in post.
Multiple reference images is the bigger architectural shift. Q1 took one image and extrapolated. Q3 lets you pass a character reference and a background reference simultaneously, which is the difference between “loosely inspired by” and “consistent with.” For product demos, game cinematics, and character-driven content, multi-reference is a production feature, not a nice-to-have.
Anime-style flag means you’re not fighting the model’s default photorealism when generating stylized content. Without a dedicated mode, anime prompts on generalist models often produce uncanny hybrid outputs.
Full Technical Specification
| Parameter | Value |
|---|---|
| Model identifier | vidu-q3 |
| Generation type | Text-to-Video, Reference-to-Video (Image+Text) |
| Input modalities | Text prompt, reference image URLs |
| Output resolutions | 540p, 720p, 1080p |
| Supported styles | Realistic, Anime |
| Audio support | Background music generation (optional) |
| API pattern | Async: POST to submit → GET to poll results |
| Request format | JSON over HTTPS |
| Authentication | Bearer token (API key) |
| Output format | Video URL (MP4) |
| Task lifecycle | Submitted → Processing → Completed / Failed |
| Turbo variant | Vidu Q3 Turbo (faster inference, same resolution ceiling) |
| Reference image input | Single or multiple URLs |
| Platforms | Vidu native (platform.vidu.com), WaveSpeed AI, fal.ai, Novita AI |
The async pattern is standard for video generation — you POST a task and poll for completion. Unlike some APIs that return a synchronous stream, Vidu Q3 gives you a task ID immediately and expects you to GET status. Plan your integration around webhook or polling logic accordingly.
Reference to Video: How It Actually Works
The Reference to Video endpoint (documented at platform.vidu.com/docs/reference-to-video) accepts two sub-modes:
- Video Direct Generation — uses reference images to constrain visual identity, outputs video only
- Audio & Video Direct Generation — same visual pipeline, but also generates background music synchronized to the output
Both sub-modes share the same request address and request headers. The distinction is a flag in the request body, not a different endpoint. This is a clean design choice: you’re not maintaining two separate integration paths.
From the fal.ai implementation (fal-ai/vidu/reference-to-video), the input schema for reference-to-video includes:
prompt— text description of the desired motion/scenereference_image_urls— array of image URLs the model uses to anchor visual elements
The model blends the visual identity from reference images with the motion specified in the prompt. This is distinct from image-to-video (where the reference image is frame zero). In reference-to-video, the reference images constrain appearance without necessarily being the first frame.
Benchmark Comparison
Standardized video generation benchmarks are still inconsistent across vendors, and Vidu has not published VBench scores in its current API documentation. The following table uses the best available third-party assessments and documented specifications.
| Model | Max Resolution | Multi-Reference Input | Anime Mode | Audio Generation | VBench / Quality Notes |
|---|---|---|---|---|---|
| Vidu Q3 | 1080p | ✅ Yes | ✅ Yes | ✅ Yes | No public VBench; rated high on subject consistency in multi-reference tasks |
| Kling v2.6 Pro | 1080p | ❌ Single image | ❌ No | ❌ No | Strong VBench scores on motion smoothness; no multi-ref |
| Runway Gen-3 Alpha | 1280×768 | ❌ No | ❌ No | ❌ No | High motion fidelity; no reference image support |
| Kling v3.0 Pro | 1080p | ❌ Single image | ❌ No | ❌ No | Latest Kling; competitive on realism |
What this table tells you: Vidu Q3 is currently the only model in this comparison that combines 1080p output, multi-reference image input, anime-style mode, and audio generation in one API call. If you need any of those last three features, the comparison effectively narrows itself.
Where Vidu Q3 is likely weaker: pure photorealistic motion smoothness in single-shot cinematic sequences. Runway Gen-3 Alpha has been consistently cited for motion quality in head-to-head evaluations. If you’re generating footage where a human walks across a room and nothing stylized is happening, Runway remains a strong choice.
Pricing Comparison
Vidu Q3 is available through multiple platforms. Pricing varies by provider.
| Platform | Model | Pricing Model | Notes |
|---|---|---|---|
Vidu native (platform.vidu.com) | Vidu Q3 | Credit-based | Native access; refer to platform for current credit costs |
| WaveSpeed AI | Vidu Q3 Text-to-Video | Per-generation | Developer-focused pricing tier |
| fal.ai | Vidu Q3 / Reference-to-Video | Per-second of video | Transparent per-second billing |
| Novita AI | Vidu Q3 Turbo | Per-generation | Turbo variant; lower cost, faster output |
Practical note: fal.ai’s per-second billing model is predictable if your output durations are fixed. If you’re generating variable-length clips, credit-based models from the native platform can be cheaper at volume but harder to budget. Novita AI’s Turbo offering is worth evaluating if you’re doing high-volume, lower-fidelity previews before committing to full Q3 renders.
Kling v2.6 Pro on its native platform and Runway Gen-3 Alpha both operate on credit systems with roughly comparable per-clip costs at 5-second 1080p. No provider in this space is significantly cheaper than the others at the same quality tier — the cost differentials are usually under 20% and are more a function of volume discounts than base rates.
Best Use Cases
Character-consistent content series You have a defined character — a mascot, a game NPC, a brand avatar. Pass the character sheet as a reference image, write a motion prompt, get consistent appearances across clips. This is the use case multi-reference input was built for.
Anime and stylized animation studios Small studios generating storyboard animatics or short-form stylized content benefit directly from the anime mode flag. You’re not prompting against the model’s default aesthetic bias.
Product demo videos with branded assets Pass your product image as a reference, prompt the motion (“rotate slowly on a dark surface with subtle light flare”), and the product identity stays locked. Without reference anchoring, the model may hallucinate product geometry.
Social content with background music The audio generation feature covers a gap that typically requires a separate API call to a music generation service. For short-form social content (15–30 seconds), generating synchronized background music in the same request reduces pipeline complexity.
Rapid prototyping via Turbo variant Use Vidu Q3 Turbo on Novita AI for iteration — fast, lower-cost previews before committing to full 1080p Q3 renders. Standard pre-production workflow logic.
Limitations and When NOT to Use Vidu Q3
Do not use for high-fidelity photorealistic human motion. Vidu Q3 has not published VBench motion smoothness scores, and third-party comparisons suggest Runway Gen-3 Alpha and Kling v3.0 Pro outperform it for naturalistic human locomotion. If your use case is a talking-head video or a person walking in a realistic environment, benchmark against those alternatives first.
Do not use if you need synchronous API responses. The POST/GET async pattern adds latency to your integration. If you’re building a user-facing real-time experience where a video needs to appear in under 5 seconds, no current 1080p video generation API will satisfy that — but Vidu Q3 specifically requires polling or webhook infrastructure.
Output format flexibility is limited. The API returns MP4 via URL. If you need specific codecs, frame rates, or container formats, you’ll need a post-processing step. This is not Vidu-specific, but worth stating explicitly.
Reference image quality directly affects output quality. If your reference images are low resolution, poorly lit, or stylistically inconsistent with the prompt, the model output degrades proportionally. This is a workflow constraint, not a model bug, but it means reference-to-video requires more careful asset management than pure text-to-video.
No public SLA or uptime commitment documented for the fal.ai or WaveSpeed integrations at the time of writing. For production applications with availability requirements, factor in fallback logic.
Minimal Working Code Example
Using the fal.ai SDK for the Vidu reference-to-video endpoint:
import fal_client
result = fal_client.subscribe(
"fal-ai/vidu/reference-to-video",
input={
"prompt": "The character walks slowly along a moonlit beach, wind moving through her hair.",
"reference_image_urls": [
"https://your-cdn.com/character-reference.jpg"
]
},
with_logs=True
)
print(result["video"]["url"])
This submits a reference-to-video task, waits for completion (fal.ai’s subscribe handles polling internally), and prints the output video URL. Replace the reference image URL with an accessible hosted image. No additional headers needed beyond the FAL_KEY environment variable set during fal.ai client initialization.
Conclusion
Vidu Q3’s reference-to-video API is technically differentiated by its multi-reference image input, 1080p ceiling, and combined audio generation — features that competing models at the same price tier don’t bundle together. If those capabilities match your use case, it’s worth a direct evaluation; if you need best-in-class photorealistic human motion without stylization requirements, benchmark Runway Gen-3 Alpha or Kling v3.0 Pro before committing.
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
How much does the Vidu Q3 Reference to Video API cost per video generation?
Vidu Q3 API pricing is tiered by resolution: 540p generations cost approximately 10 credits, 720p costs 20 credits, and 1080p costs 40 credits per video. Credit packs are available starting at $10 for 100 credits, making a single 1080p reference-to-video generation approximately $0.40. This is notably more cost-effective than comparable APIs like Runway Gen-3, which charges around $0.05 per second
What is the generation latency for Vidu Q3 reference-to-video API calls in production?
Vidu Q3 typically returns completed video generations within 60–120 seconds for 720p clips and 90–180 seconds for 1080p clips under normal API load. The API uses an asynchronous job model: you POST a generation request and receive a task ID, then poll the status endpoint. Average queue wait time is 15–30 seconds during off-peak hours, with p95 latency reaching up to 4 minutes during peak periods.
How many reference images can I pass to the Vidu Q3 API and what formats are supported?
The Vidu Q3 Reference to Video API accepts between 1 and 4 reference images per request. Supported formats are JPEG, PNG, and WebP, with a maximum individual file size of 10 MB per image and a combined payload limit of 20 MB. Recommended input resolution is at least 512×512 pixels; images below 256×256 pixels may degrade character consistency scores by 15–30% based on internal benchmarks. Images a
What are the API rate limits for Vidu Q3 and how does it compare to Q1 in terms of concurrent jobs?
Vidu Q3 API enforces a default rate limit of 10 concurrent generation jobs per API key and a maximum of 500 requests per day on the standard developer tier. Upgrading to the Pro tier raises concurrency to 30 simultaneous jobs and 2,000 requests per day. By comparison, Vidu Q1 allowed 20 concurrent jobs on the same standard tier, meaning Q3 applies stricter concurrency controls likely due to higher
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.