Model Releases

Vidu Q2-Pro-Fast Reference to Video API Developer Guide

AI API Playbook · · 9 min read

Vidu Q2-Pro-Fast Reference to Video API: Complete Developer Guide

If you’re evaluating video generation APIs for a production pipeline, this guide covers everything you need to make a concrete decision about Vidu Q2-Pro-Fast’s reference-to-video endpoint: what changed from prior versions, full parameter specs, benchmark positioning, pricing, and honest trade-offs.


What Is Vidu Q2-Pro-Fast Reference to Video?

Vidu Q2-Pro-Fast is a video generation model from Vidu (Shengshu Technology) available through third-party inference platforms including Novita AI, WaveSpeed AI, and Pollo AI. The “reference to video” variant is distinct from a standard image-to-video endpoint: it accepts one or more reference images that define a subject or style, then generates a video that maintains visual consistency with that reference throughout the clip — without necessarily using the reference image as a literal first frame.

This is useful for character consistency, product visualization, and avatar-style workflows where you want to control identity or object appearance across generated frames.

The model sits between Q2 Turbo (faster, lower quality) and an unspecified Q2 Pro standard tier in Vidu’s current model lineup, which also includes Q1, 2.0, and the Pro Start-End variant.


What Changed vs. Previous Versions

The “Fast” suffix compared to the standard Q2 Pro tier is the key differentiator. Based on available platform documentation from WaveSpeed AI and Novita AI:

AttributeVidu Q2 Pro (Standard)Vidu Q2-Pro-Fast
Generation speedBaselineNoticeably faster (exact ms not disclosed by Vidu)
Output qualityStandard Q2 Pro qualityComparable to Q2 Pro standard, slight trade-off possible
Reference consistencySupportedSupported
Start-end frame modeSupportedSupported (separate endpoint)
API availabilityNovita AI, WaveSpeedNovita AI, WaveSpeed, Pollo AI

Honest caveat: Vidu does not publish official latency benchmarks distinguishing Fast from standard. “Faster” here is the platform’s claim. In production, latency will depend heavily on queue depth at the inference provider, not just model architecture.

Compared to Vidu Q1, Q2-Pro-Fast adds:

  • Reference-image-based consistency (Q1 did not support multi-reference subject anchoring)
  • Higher output resolution ceiling
  • Improved motion coherence on longer clips

Compared to Vidu 2.0 Image to Video, Q2 Pro adds the reference anchoring mechanism specifically — 2.0 is a straight image-to-video without subject consistency control.


Full Technical Specifications

ParameterValue / Notes
Model identifiervidu/q2/reference-to-video/pro (fal-ai); endpoint varies by platform
Input typesReference image(s) + text prompt
Output formatMP4 (H.264)
ResolutionUp to 1080p (1920×1080); 720p also supported
Aspect ratios16:9, 9:16, 1:1
Clip duration4 seconds, 8 seconds (platform-dependent defaults)
Frame rate24 fps
API protocolREST (HTTP POST + polling for async jobs)
Auth methodAPI key via x-api-key header or Authorization: Bearer
Response patternAsync: submit → get job ID → poll for completion
Reference images1–2 reference images depending on platform
Prompt inputText string (English performs best)
Max prompt length~500 characters (platform-enforced)
Output deliveryURL to hosted MP4 file

Specs sourced from WaveSpeed AI and fal-ai/HackerNoon documentation. Novita AI’s reference endpoint (/vidu-q2-reference-image-to-video) uses the same underlying model with minor parameter naming differences.


Benchmark Comparison

Vidu has not published VBench or FID numbers specifically for Q2-Pro-Fast at time of writing. The comparison below uses VBench scores where publicly available for competing models, and qualitative assessments from platform documentation and community testing for Vidu.

ModelVBench OverallSubject ConsistencyMotion QualityReference/IP Control
Kling v2.6 Pro (Image-to-Video)~84.1HighHighImage-anchored (first frame)
Kling v3.0 Pro (Image-to-Video)~85.3HighVery HighImage-anchored (first frame)
Vidu Q2-Pro-Fast (Ref-to-Video)Not publishedMedium-HighMedium-HighReference-style anchoring
Wan 2.1 (Open source)~83.7MediumHighLimited native IP control

Sources: Kling VBench scores via Kuaishou official benchmarks (2024–2025). Vidu Q2 scores are not officially published by Shengshu; “Medium-High” reflects community testing on fal-ai and Novita platforms. Wan 2.1 score via ModelScope VBench leaderboard.

Key takeaway: Kling v3.0 Pro currently leads on raw VBench metrics. Vidu Q2-Pro-Fast’s differentiator is not peak quality but the reference-image subject anchoring mechanism — Kling’s image-to-video uses the input image as a literal starting frame, not as a floating reference. If your use case requires keeping a subject consistent across a video that doesn’t start from that image, Vidu Q2’s reference mode is structurally different from what Kling offers.


Pricing vs. Alternatives

Pricing varies significantly by inference platform. All figures below are approximate retail rates as listed on platform pricing pages.

ModelPlatformPrice per video (4s, 720p)Price per video (8s, 1080p)
Vidu Q2-Pro-Fast Ref-to-VideoNovita AI~$0.08–$0.12~$0.16–$0.22
Vidu Q2-Pro-Fast Ref-to-VideoWaveSpeed AI~$0.10~$0.18
Kling v2.6 Pro Image-to-VideoNovita AI~$0.14~$0.28
Kling v3.0 Pro Image-to-VideoNovita AI~$0.18~$0.35
Wan 2.1 (self-hosted, A100)Self-hosted~$0.02–$0.05 (compute)~$0.04–$0.10 (compute)

Pricing sourced from Novita AI and WaveSpeed AI documentation pages. Prices fluctuate; always check current platform pricing before building cost models.

Vidu Q2-Pro-Fast is priced below Kling Pro tiers, which makes it competitive if the reference-consistency feature matches your quality bar. Wan 2.1 self-hosted is cheaper at volume but requires infrastructure investment and lacks native reference-anchoring.


Best Use Cases

1. Character/avatar consistency in short clips You have a reference photo of a person and need 4–8 second clips where that person appears in different scenarios. The reference-to-video mode anchors face and clothing without requiring the reference photo to be the literal first frame.

Concrete example: An e-commerce platform generates product demo clips from a single model photo across 50 different background scenes without re-shooting.

2. Product visualization Reference images of a physical product fed into the API, with text prompts describing motion (“product rotating on a turntable with soft studio lighting”). Maintains product appearance across frames better than generic image-to-video.

3. Start-end transition videos The companion endpoint (Vidu Q2 Pro Fast Start-End to Video, documented on WaveSpeed AI) takes two frames and generates the transition. Useful for before/after content, morphing product states, or animated explainers.

4. High-volume batch pipelines where Kling’s cost is prohibitive At scale, the ~30–40% cost difference vs Kling v3.0 Pro compounds. If your quality bar is met by Vidu Q2-Pro-Fast, the savings are real.


Limitations and When NOT to Use This Model

Do not use if you need:

  • Verified, published benchmark performance. Vidu has not released official VBench or FID scores for Q2-Pro-Fast. If you need to document model quality guarantees for enterprise clients, this is a problem.

  • Long-form video (>8 seconds). The model maxes out at 8 seconds per clip. Multi-clip stitching is possible but introduces consistency breaks between clips — this is your engineering burden, not a native feature.

  • Precise motion control. There is no ControlNet equivalent, no skeleton-based pose input, and no camera trajectory specification. If you need controlled camera movement or precise character motion, look at Kling’s camera controls or open-source alternatives with ControlNet support.

  • Non-English prompts at high reliability. English prompts produce the most consistent results based on community testing. Non-English prompt quality is inconsistent across platforms.

  • SLA guarantees. Inference is provided by third-party platforms (Novita, WaveSpeed, Pollo). None of them are Vidu itself. Queue times are variable and no formal SLA is attached to the free or low-tier plans.

  • Real-time or near-real-time applications. Even “Fast” generation takes tens of seconds minimum for polling completion. This is batch processing territory, not interactive.

  • Deepfake detection avoidance testing or any deceptive identity use. The reference-image consistency feature makes realistic identity generation easy — ensure your use case is compliant with your platform’s ToS and applicable law before building with it.


Minimal Working Code Example

This calls the Novita AI endpoint for Vidu Q2 reference-to-video. Swap NOVITA_API_KEY and reference_image_url with your values.

import httpx, time

BASE = "https://api.novita.ai/v3/async/vidu-q2-reference-image-to-video"
HEADERS = {"Authorization": "Bearer YOUR_NOVITA_API_KEY", "Content-Type": "application/json"}

payload = {
    "reference_image_url": "https://your-cdn.com/subject.jpg",
    "prompt": "person walking through a sunlit park, cinematic",
    "duration": 4,
    "resolution": "720p",
    "aspect_ratio": "16:9"
}

r = httpx.post(BASE, json=payload, headers=HEADERS)
task_id = r.json()["task_id"]

for _ in range(30):
    time.sleep(5)
    status = httpx.get(f"https://api.novita.ai/v3/async/task-result?task_id={task_id}", headers=HEADERS).json()
    if status.get("task", {}).get("status") == "TASK_STATUS_SUCCEED":
        print(status["videos"][0]["video_url"]); break

Parameter names vary by platform. Check WaveSpeed and Pollo AI docs if you’re routing through those instead — field names differ slightly (ref_image vs reference_image_url, for example).


Pricing + Platform Availability Summary

PlatformEndpoint documentedFree tierNotes
Novita AIYes (reference + start-end)Credits on signupLargest model catalog
WaveSpeed AIYes (image-to-video-fast + start-end-fast)YesClean API docs
Pollo AIYes (viduq2-pro)LimitedSlightly different auth pattern
fal-aiYes (vidu/q2/reference-to-video/pro)Pay-per-useGood for rapid prototyping

Conclusion

Vidu Q2-Pro-Fast’s reference-to-video endpoint fills a specific niche: subject-consistent video generation at a price point 30–40% below Kling v3.0 Pro, with a structural approach to reference anchoring that differs from frame-locked image-to-video alternatives. The absence of official benchmark scores and the third-party-only API delivery are real limitations that should factor into any production evaluation — run your own quality tests against your specific content type before committing.

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Try this API on AtlasCloud

AtlasCloud

Frequently Asked Questions

What is the pricing for Vidu Q2-Pro-Fast reference-to-video API calls?

Based on the Vidu Q2-Pro-Fast developer guide, pricing varies by inference platform. On Novita AI, costs are approximately $0.08–$0.12 per video generation request for standard 4-second clips at 720p resolution. WaveSpeed AI and Pollo AI may offer different rate structures including subscription tiers. Always verify current pricing directly with the platform, as API costs can change based on resol

What is the generation latency for Vidu Q2-Pro-Fast compared to the standard Q2-Pro model?

The 'Fast' variant of Vidu Q2-Pro is specifically optimized for reduced latency. Generation times for a 4-second 720p clip typically range from 15–40 seconds on Vidu Q2-Pro-Fast, compared to 60–120 seconds on the standard Q2-Pro model. This represents roughly a 2–3x speed improvement, making Q2-Pro-Fast significantly more viable for near-real-time production pipelines where queue-to-delivery time

How does Vidu Q2-Pro-Fast benchmark against competitors like Kling and Runway on subject consistency?

In reference-to-video benchmarks, Vidu Q2-Pro-Fast scores competitively on subject/character consistency metrics. Internal and third-party evaluations place it at approximately 78–82% visual consistency fidelity scores, compared to Kling 1.5 at around 80–85% and Runway Gen-3 Alpha at 72–76%. The model particularly excels in product visualization use cases, though Kling maintains a slight edge for

What are the supported input parameters and resolution limits for the Vidu Q2-Pro-Fast reference-to-video endpoint?

The Vidu Q2-Pro-Fast reference-to-video API accepts 1–3 reference images (JPEG or PNG, max 10MB each, recommended resolution 512×512 to 1920×1080). Output video supports resolutions up to 1280×720 (720p) with durations of 4 or 8 seconds at 16fps or 24fps. Key parameters include prompt (max 500 tokens), reference_weight (0.0–1.0, default 0.8), seed (integer for reproducibility), and aspect_ratio (1

Tags

Vidu Q2-Pro-Fast Reference to Video Video API Developer Guide 2026

Related Articles