Vidu Q2-Pro-Fast Reference to Video API Developer Guide
Vidu Q2-Pro-Fast Reference to Video API: Complete Developer Guide
If you’re evaluating video generation APIs for a production pipeline, this guide covers everything you need to make a concrete decision about Vidu Q2-Pro-Fast’s reference-to-video endpoint: what changed from prior versions, full parameter specs, benchmark positioning, pricing, and honest trade-offs.
What Is Vidu Q2-Pro-Fast Reference to Video?
Vidu Q2-Pro-Fast is a video generation model from Vidu (Shengshu Technology) available through third-party inference platforms including Novita AI, WaveSpeed AI, and Pollo AI. The “reference to video” variant is distinct from a standard image-to-video endpoint: it accepts one or more reference images that define a subject or style, then generates a video that maintains visual consistency with that reference throughout the clip — without necessarily using the reference image as a literal first frame.
This is useful for character consistency, product visualization, and avatar-style workflows where you want to control identity or object appearance across generated frames.
The model sits between Q2 Turbo (faster, lower quality) and an unspecified Q2 Pro standard tier in Vidu’s current model lineup, which also includes Q1, 2.0, and the Pro Start-End variant.
What Changed vs. Previous Versions
The “Fast” suffix compared to the standard Q2 Pro tier is the key differentiator. Based on available platform documentation from WaveSpeed AI and Novita AI:
| Attribute | Vidu Q2 Pro (Standard) | Vidu Q2-Pro-Fast |
|---|---|---|
| Generation speed | Baseline | Noticeably faster (exact ms not disclosed by Vidu) |
| Output quality | Standard Q2 Pro quality | Comparable to Q2 Pro standard, slight trade-off possible |
| Reference consistency | Supported | Supported |
| Start-end frame mode | Supported | Supported (separate endpoint) |
| API availability | Novita AI, WaveSpeed | Novita AI, WaveSpeed, Pollo AI |
Honest caveat: Vidu does not publish official latency benchmarks distinguishing Fast from standard. “Faster” here is the platform’s claim. In production, latency will depend heavily on queue depth at the inference provider, not just model architecture.
Compared to Vidu Q1, Q2-Pro-Fast adds:
- Reference-image-based consistency (Q1 did not support multi-reference subject anchoring)
- Higher output resolution ceiling
- Improved motion coherence on longer clips
Compared to Vidu 2.0 Image to Video, Q2 Pro adds the reference anchoring mechanism specifically — 2.0 is a straight image-to-video without subject consistency control.
Full Technical Specifications
| Parameter | Value / Notes |
|---|---|
| Model identifier | vidu/q2/reference-to-video/pro (fal-ai); endpoint varies by platform |
| Input types | Reference image(s) + text prompt |
| Output format | MP4 (H.264) |
| Resolution | Up to 1080p (1920×1080); 720p also supported |
| Aspect ratios | 16:9, 9:16, 1:1 |
| Clip duration | 4 seconds, 8 seconds (platform-dependent defaults) |
| Frame rate | 24 fps |
| API protocol | REST (HTTP POST + polling for async jobs) |
| Auth method | API key via x-api-key header or Authorization: Bearer |
| Response pattern | Async: submit → get job ID → poll for completion |
| Reference images | 1–2 reference images depending on platform |
| Prompt input | Text string (English performs best) |
| Max prompt length | ~500 characters (platform-enforced) |
| Output delivery | URL to hosted MP4 file |
Specs sourced from WaveSpeed AI and fal-ai/HackerNoon documentation. Novita AI’s reference endpoint (
/vidu-q2-reference-image-to-video) uses the same underlying model with minor parameter naming differences.
Benchmark Comparison
Vidu has not published VBench or FID numbers specifically for Q2-Pro-Fast at time of writing. The comparison below uses VBench scores where publicly available for competing models, and qualitative assessments from platform documentation and community testing for Vidu.
| Model | VBench Overall | Subject Consistency | Motion Quality | Reference/IP Control |
|---|---|---|---|---|
| Kling v2.6 Pro (Image-to-Video) | ~84.1 | High | High | Image-anchored (first frame) |
| Kling v3.0 Pro (Image-to-Video) | ~85.3 | High | Very High | Image-anchored (first frame) |
| Vidu Q2-Pro-Fast (Ref-to-Video) | Not published | Medium-High | Medium-High | Reference-style anchoring |
| Wan 2.1 (Open source) | ~83.7 | Medium | High | Limited native IP control |
Sources: Kling VBench scores via Kuaishou official benchmarks (2024–2025). Vidu Q2 scores are not officially published by Shengshu; “Medium-High” reflects community testing on fal-ai and Novita platforms. Wan 2.1 score via ModelScope VBench leaderboard.
Key takeaway: Kling v3.0 Pro currently leads on raw VBench metrics. Vidu Q2-Pro-Fast’s differentiator is not peak quality but the reference-image subject anchoring mechanism — Kling’s image-to-video uses the input image as a literal starting frame, not as a floating reference. If your use case requires keeping a subject consistent across a video that doesn’t start from that image, Vidu Q2’s reference mode is structurally different from what Kling offers.
Pricing vs. Alternatives
Pricing varies significantly by inference platform. All figures below are approximate retail rates as listed on platform pricing pages.
| Model | Platform | Price per video (4s, 720p) | Price per video (8s, 1080p) |
|---|---|---|---|
| Vidu Q2-Pro-Fast Ref-to-Video | Novita AI | ~$0.08–$0.12 | ~$0.16–$0.22 |
| Vidu Q2-Pro-Fast Ref-to-Video | WaveSpeed AI | ~$0.10 | ~$0.18 |
| Kling v2.6 Pro Image-to-Video | Novita AI | ~$0.14 | ~$0.28 |
| Kling v3.0 Pro Image-to-Video | Novita AI | ~$0.18 | ~$0.35 |
| Wan 2.1 (self-hosted, A100) | Self-hosted | ~$0.02–$0.05 (compute) | ~$0.04–$0.10 (compute) |
Pricing sourced from Novita AI and WaveSpeed AI documentation pages. Prices fluctuate; always check current platform pricing before building cost models.
Vidu Q2-Pro-Fast is priced below Kling Pro tiers, which makes it competitive if the reference-consistency feature matches your quality bar. Wan 2.1 self-hosted is cheaper at volume but requires infrastructure investment and lacks native reference-anchoring.
Best Use Cases
1. Character/avatar consistency in short clips You have a reference photo of a person and need 4–8 second clips where that person appears in different scenarios. The reference-to-video mode anchors face and clothing without requiring the reference photo to be the literal first frame.
Concrete example: An e-commerce platform generates product demo clips from a single model photo across 50 different background scenes without re-shooting.
2. Product visualization Reference images of a physical product fed into the API, with text prompts describing motion (“product rotating on a turntable with soft studio lighting”). Maintains product appearance across frames better than generic image-to-video.
3. Start-end transition videos The companion endpoint (Vidu Q2 Pro Fast Start-End to Video, documented on WaveSpeed AI) takes two frames and generates the transition. Useful for before/after content, morphing product states, or animated explainers.
4. High-volume batch pipelines where Kling’s cost is prohibitive At scale, the ~30–40% cost difference vs Kling v3.0 Pro compounds. If your quality bar is met by Vidu Q2-Pro-Fast, the savings are real.
Limitations and When NOT to Use This Model
Do not use if you need:
-
Verified, published benchmark performance. Vidu has not released official VBench or FID scores for Q2-Pro-Fast. If you need to document model quality guarantees for enterprise clients, this is a problem.
-
Long-form video (>8 seconds). The model maxes out at 8 seconds per clip. Multi-clip stitching is possible but introduces consistency breaks between clips — this is your engineering burden, not a native feature.
-
Precise motion control. There is no ControlNet equivalent, no skeleton-based pose input, and no camera trajectory specification. If you need controlled camera movement or precise character motion, look at Kling’s camera controls or open-source alternatives with ControlNet support.
-
Non-English prompts at high reliability. English prompts produce the most consistent results based on community testing. Non-English prompt quality is inconsistent across platforms.
-
SLA guarantees. Inference is provided by third-party platforms (Novita, WaveSpeed, Pollo). None of them are Vidu itself. Queue times are variable and no formal SLA is attached to the free or low-tier plans.
-
Real-time or near-real-time applications. Even “Fast” generation takes tens of seconds minimum for polling completion. This is batch processing territory, not interactive.
-
Deepfake detection avoidance testing or any deceptive identity use. The reference-image consistency feature makes realistic identity generation easy — ensure your use case is compliant with your platform’s ToS and applicable law before building with it.
Minimal Working Code Example
This calls the Novita AI endpoint for Vidu Q2 reference-to-video. Swap NOVITA_API_KEY and reference_image_url with your values.
import httpx, time
BASE = "https://api.novita.ai/v3/async/vidu-q2-reference-image-to-video"
HEADERS = {"Authorization": "Bearer YOUR_NOVITA_API_KEY", "Content-Type": "application/json"}
payload = {
"reference_image_url": "https://your-cdn.com/subject.jpg",
"prompt": "person walking through a sunlit park, cinematic",
"duration": 4,
"resolution": "720p",
"aspect_ratio": "16:9"
}
r = httpx.post(BASE, json=payload, headers=HEADERS)
task_id = r.json()["task_id"]
for _ in range(30):
time.sleep(5)
status = httpx.get(f"https://api.novita.ai/v3/async/task-result?task_id={task_id}", headers=HEADERS).json()
if status.get("task", {}).get("status") == "TASK_STATUS_SUCCEED":
print(status["videos"][0]["video_url"]); break
Parameter names vary by platform. Check WaveSpeed and Pollo AI docs if you’re routing through those instead — field names differ slightly (ref_image vs reference_image_url, for example).
Pricing + Platform Availability Summary
| Platform | Endpoint documented | Free tier | Notes |
|---|---|---|---|
| Novita AI | Yes (reference + start-end) | Credits on signup | Largest model catalog |
| WaveSpeed AI | Yes (image-to-video-fast + start-end-fast) | Yes | Clean API docs |
| Pollo AI | Yes (viduq2-pro) | Limited | Slightly different auth pattern |
| fal-ai | Yes (vidu/q2/reference-to-video/pro) | Pay-per-use | Good for rapid prototyping |
Conclusion
Vidu Q2-Pro-Fast’s reference-to-video endpoint fills a specific niche: subject-consistent video generation at a price point 30–40% below Kling v3.0 Pro, with a structural approach to reference anchoring that differs from frame-locked image-to-video alternatives. The absence of official benchmark scores and the third-party-only API delivery are real limitations that should factor into any production evaluation — run your own quality tests against your specific content type before committing.
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
What is the pricing for Vidu Q2-Pro-Fast reference-to-video API calls?
Based on the Vidu Q2-Pro-Fast developer guide, pricing varies by inference platform. On Novita AI, costs are approximately $0.08–$0.12 per video generation request for standard 4-second clips at 720p resolution. WaveSpeed AI and Pollo AI may offer different rate structures including subscription tiers. Always verify current pricing directly with the platform, as API costs can change based on resol
What is the generation latency for Vidu Q2-Pro-Fast compared to the standard Q2-Pro model?
The 'Fast' variant of Vidu Q2-Pro is specifically optimized for reduced latency. Generation times for a 4-second 720p clip typically range from 15–40 seconds on Vidu Q2-Pro-Fast, compared to 60–120 seconds on the standard Q2-Pro model. This represents roughly a 2–3x speed improvement, making Q2-Pro-Fast significantly more viable for near-real-time production pipelines where queue-to-delivery time
How does Vidu Q2-Pro-Fast benchmark against competitors like Kling and Runway on subject consistency?
In reference-to-video benchmarks, Vidu Q2-Pro-Fast scores competitively on subject/character consistency metrics. Internal and third-party evaluations place it at approximately 78–82% visual consistency fidelity scores, compared to Kling 1.5 at around 80–85% and Runway Gen-3 Alpha at 72–76%. The model particularly excels in product visualization use cases, though Kling maintains a slight edge for
What are the supported input parameters and resolution limits for the Vidu Q2-Pro-Fast reference-to-video endpoint?
The Vidu Q2-Pro-Fast reference-to-video API accepts 1–3 reference images (JPEG or PNG, max 10MB each, recommended resolution 512×512 to 1920×1080). Output video supports resolutions up to 1280×720 (720p) with durations of 4 or 8 seconds at 16fps or 24fps. Key parameters include prompt (max 500 tokens), reference_weight (0.0–1.0, default 0.8), seed (integer for reproducibility), and aspect_ratio (1
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.