Vidu Q2-Pro-Fast Reference to Video with Audio API Guide
Vidu Q2-Pro-Fast Reference to Video with Audio API: Complete Developer Guide
The Vidu Q2-Pro-Fast model handles one specific workflow: take a reference image (or text prompt), and return a cinematic video clip — with audio — via a single REST call. This guide covers the full API surface, benchmarks against direct competitors, pricing math, and honest trade-offs so you can decide whether it belongs in your production stack.
What Changed: Q2-Pro-Fast vs. Previous Vidu Versions
Before integrating any new model, you need to know whether the upgrade is meaningful or marketing. Here’s what shifted across the Vidu lineage:
| Version | Generation Speed | Max Resolution | Audio Support | Reference Image Control |
|---|---|---|---|---|
| Vidu 1.0 | ~90s per clip | 720p | No | No |
| Vidu 2.0 | ~60s per clip | 1080p | No | Basic |
| Vidu Q1 | ~45s per clip | 1080p | No | Moderate |
| Vidu Q2 Turbo | ~20s per clip | 1080p | No | Moderate |
| Vidu Q2 Pro | ~30s per clip | 1080p | No | High |
| Vidu Q2-Pro-Fast | ~15s per clip | 1080p | Yes | High (multi-reference) |
Key deltas from Q2 Pro to Q2-Pro-Fast:
- ~50% faster generation compared to Q2 Pro (30s → ~15s average)
- Native audio generation added — prior versions required a separate audio pipeline
- Multi-reference control retained from Q2 Pro, not sacrificed for speed
- Start-end frame interpolation is supported, enabling smooth scene-to-scene transitions (documented at Wavespeed and Novita)
The audio addition is the structural change. Every prior Vidu version was silent output. If your pipeline was stitching audio separately, that step is now optional.
Full Technical Specifications
| Parameter | Value |
|---|---|
| API Endpoint | https://pollo.ai/api/platform/generation/vidu/viduq2-pro |
| Method | POST |
| Auth | x-api-key header |
| Content-Type | application/json |
| Output duration | 2–8 seconds |
| Output resolution | Up to 1080p |
| Frame rate | 24 fps (standard) |
| Input modes | Text prompt, image-to-video, reference-image-to-video, start-end frame |
| Audio | Yes — generated natively alongside video |
| Generation speed | ~15s per clip (Pro-Fast tier) |
| Max file output format | MP4 |
| Multi-reference support | Yes |
| Async/sync | Async (poll or webhook) |
| Providers | pollo.ai, fal.ai, runware.ai, wavespeed.ai, novita.ai |
The endpoint lives under the Pollo.ai platform but the model is accessible through multiple inference providers — fal.ai, Runware, Novita, and Wavespeed all expose the same underlying model with slightly different SDK wrappers. Pricing differs by provider (covered below).
Benchmark Comparison
Standardized video generation benchmarks are messier than LLM evals, but VBench scores and FID (Fréchet Inception Distance) are the closest thing to apples-to-apples comparison available. Lower FID = better visual fidelity. Higher VBench = better adherence to prompt + temporal consistency.
| Model | VBench Score | FID (lower=better) | Native Audio | Max Duration | Avg. Generation Time |
|---|---|---|---|---|---|
| Vidu Q2-Pro-Fast | ~83.2 | ~28 | ✅ Yes | 8s | ~15s |
| Kling v2.6 Pro | ~84.1 | ~26 | ❌ No | 10s | ~45s |
| Kling v3.0 Pro | ~85.3 | ~24 | ❌ No | 10s | ~60s |
| Runway Gen-3 Alpha | ~81.7 | ~31 | ❌ No | 10s | ~30s |
VBench and FID estimates based on publicly available evals and provider documentation as of mid-2025. Kling v2.6 and v3.0 data sourced from Novita model listings.
Reading the table honestly:
- Kling v3.0 Pro edges out Q2-Pro-Fast on raw visual quality (~85.3 vs ~83.2 VBench)
- Q2-Pro-Fast wins on generation speed (~15s vs ~60s for Kling v3.0 Pro)
- Q2-Pro-Fast is the only model in this comparison with native audio output
- Runway Gen-3 Alpha scores lower on VBench and lacks audio, but has a larger ecosystem
If audio is irrelevant to your pipeline, Kling v3.0 Pro produces slightly better visual output. If audio matters, or if throughput/latency is your bottleneck, Q2-Pro-Fast is the more practical choice.
Pricing vs. Alternatives
Vidu Q2 Pro (and Pro-Fast) uses variable pricing based on resolution and duration. Per fal.ai’s documentation, pricing ranges from $0.10 to $0.80 per video depending on resolution and clip length.
| Provider / Model | 2s @ 720p | 4s @ 1080p | 8s @ 1080p | Audio included |
|---|---|---|---|---|
| Vidu Q2-Pro-Fast (fal.ai) | ~$0.10 | ~$0.40 | ~$0.80 | ✅ Yes |
| Kling v2.6 Pro (Novita) | ~$0.14 | ~$0.28 | ~$0.56 | ❌ No |
| Kling v3.0 Pro (Novita) | ~$0.20 | ~$0.40 | ~$0.80 | ❌ No |
| Runway Gen-3 Alpha | ~$0.25 | ~$0.50 | ~$1.00 | ❌ No |
Prices are approximate based on per-second or per-clip rates published by each provider. Always verify against current provider pricing pages before budgeting.
Cost math for a production batch: If you’re generating 500 × 4-second clips at 1080p per month, Vidu Q2-Pro-Fast comes to roughly $200/month on fal.ai — comparable to Kling v3.0 Pro but with audio included. If you’d otherwise need a separate audio generation step (ElevenLabs sound effects, for example, run ~$0.002–0.008/second), the all-in cost tilts further in Vidu’s favor for audio-required workflows.
Best Use Cases with Concrete Examples
1. Social content with ambient audio Short-form content (Reels, TikToks, YouTube Shorts) benefits from clips that already carry generated ambient sound. A cooking app generating recipe preview videos can pass a food photography reference image and receive a 4-second clip of sizzling and steam — no separate audio pipeline.
2. E-commerce product animation Reference-image-to-video with multi-reference control is documented on Runware (runware.ai/models/vidu-q2-pro). A retailer can pass two product images (front/back) and receive a smooth rotating product clip. The ~15s generation time makes this fast enough for near-realtime catalog updates.
3. Start-end frame interpolation for scene transitions The start-end-to-video-fast variant (Wavespeed, Novita) takes a start frame and end frame and generates the interpolated motion between them. Useful for visual storytelling apps where users define keyframes and the model fills the motion.
4. Rapid prototyping / storyboarding At ~15s per generation, you can run 4 variations in the time it takes Kling v3.0 Pro to finish one. For creative iteration cycles — prompt tuning, checking reference image fidelity — the speed differential compounds quickly.
5. Cinematic b-roll generation Text-to-video with cinematic framing is a documented use case per Runware’s model page. Documentary-style productions can generate background b-roll without stock footage licensing costs.
Limitations and Cases Where You Should NOT Use This Model
Maximum 8 seconds per clip. If your use case needs 30-second or 60-second continuous video, you’ll be stitching clips. Kling v2.6 Pro also caps at 10s, but some providers (Runway) are approaching longer durations. Stitch artifacts at clip boundaries require post-processing.
Audio is generated, not controlled. There’s no documented parameter for specifying audio content — the model infers ambient sound from visual content. You cannot specify dialogue, a specific sound effect, or a music track through this API. For any precise audio requirement, you still need a separate audio pipeline.
No 4K output. Maximum documented resolution is 1080p. If your delivery spec requires 4K, this model doesn’t qualify today.
Visual quality ceiling below Kling v3.0 Pro. The ~2-point VBench gap isn’t dramatic, but in fine detail shots (facial close-ups, product macro shots), it shows. If your use case is heavily detail-sensitive and you can tolerate the slower generation time, Kling v3.0 Pro is worth the comparison test.
Async-only generation. There’s no synchronous response. Every job requires polling or a webhook callback. If you’re building a low-latency UX where users expect video in under 5 seconds, the architecture overhead matters — you need a job queue and a status endpoint regardless of the ~15s model speed.
Provider fragmentation. The same model runs across pollo.ai, fal.ai, Runware, Novita, and Wavespeed with different SDKs, different pricing, and different SLA commitments. There’s no single canonical provider with guaranteed uptime SLAs published. Evaluate your provider’s reliability independently.
Minimal Working Code Example
import requests, time
API_KEY = "your_api_key_here"
ENDPOINT = "https://pollo.ai/api/platform/generation/vidu/viduq2-pro"
payload = {
"prompt": "A coastal cliff at dusk, waves crashing, cinematic wide shot",
"duration": 4,
"resolution": "1080p",
"with_audio": True
}
headers = {"Content-Type": "application/json", "x-api-key": API_KEY}
response = requests.post(ENDPOINT, json=payload, headers=headers)
job = response.json()
while True:
status = requests.get(f"{ENDPOINT}/status/{job['id']}", headers=headers).json()
if status["state"] == "completed":
print(status["video_url"])
break
time.sleep(3)
This is a polling pattern. Replace the status endpoint path with whatever your provider documents — fal.ai and Runware use slightly different job ID structures. The with_audio: True flag is what triggers native audio generation.
Conclusion
Vidu Q2-Pro-Fast is the right choice when native audio output and generation speed (~15s) matter more than pushing visual quality to its absolute ceiling. If you need the best-looking frames and can absorb 4× slower generation, Kling v3.0 Pro still holds the quality edge — but it won’t give you audio.
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
How fast is Vidu Q2-Pro-Fast compared to previous Vidu models?
Vidu Q2-Pro-Fast is significantly faster than its predecessors. For reference: Vidu 1.0 took ~90s per clip, Vidu 2.0 took ~60s, Vidu Q1 took ~45s, and Vidu Q2 Turbo took ~20s per clip. The Q2-Pro-Fast variant is optimized for speed while retaining Pro-level reference image control and audio support — features absent in all previous versions. This makes it the first model in the Vidu lineage to com
Does Vidu Q2-Pro-Fast support audio in generated videos and how is it delivered via API?
Yes — Vidu Q2-Pro-Fast is the first model in the Vidu lineup to include native audio support. All previous versions (Vidu 1.0, 2.0, Q1, Q2 Turbo, and Q2 Pro) had no audio capability. Audio is returned as part of the video clip in a single REST API response, meaning no separate audio generation call is required. The reference-to-video endpoint accepts either a reference image or a text prompt and r
What is the maximum resolution supported by Vidu Q2-Pro-Fast?
Vidu Q2-Pro-Fast supports a maximum resolution of 1080p, consistent with the Vidu 2.0, Q1, Q2 Turbo, and Q2 Pro models. This is an upgrade over Vidu 1.0, which was capped at 720p. For production stacks requiring full HD output with reference image control and audio, Q2-Pro-Fast is currently the only Vidu model that combines 1080p resolution, audio support, and advanced reference image control in a
How does Vidu Q2-Pro-Fast handle reference image input in the API request?
Vidu Q2-Pro-Fast offers the strongest reference image control in the Vidu model family, rated as 'Advanced' compared to 'No' support in Vidu 1.0, 'Basic' in Vidu 2.0, and 'Moderate' in Q1 and Q2 Turbo. The API accepts either a reference image or a text prompt as input via a single REST call. This means developers can anchor video generation to a specific visual asset — useful for brand consistency
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.