How much does the Vidu Q3 Reference to Video API cost per video generation?

Vidu Q3 API pricing is tiered by resolution: 540p generations cost approximately 10 credits, 720p costs 20 credits, and 1080p costs 40 credits per video. Credit packs are available starting at $10 for 100 credits, making a single 1080p reference-to-video generation approximately $0.40. This is notably more cost-effective than comparable APIs like Runway Gen-3, which charges around $0.05 per second

What is the generation latency for Vidu Q3 reference-to-video API calls in production?

Vidu Q3 typically returns completed video generations within 60–120 seconds for 720p clips and 90–180 seconds for 1080p clips under normal API load. The API uses an asynchronous job model: you POST a generation request and receive a task ID, then poll the status endpoint. Average queue wait time is 15–30 seconds during off-peak hours, with p95 latency reaching up to 4 minutes during peak periods.

How many reference images can I pass to the Vidu Q3 API and what formats are supported?

The Vidu Q3 Reference to Video API accepts between 1 and 4 reference images per request. Supported formats are JPEG, PNG, and WebP, with a maximum individual file size of 10 MB per image and a combined payload limit of 20 MB. Recommended input resolution is at least 512×512 pixels; images below 256×256 pixels may degrade character consistency scores by 15–30% based on internal benchmarks. Images a

What are the API rate limits for Vidu Q3 and how does it compare to Q1 in terms of concurrent jobs?

Vidu Q3 API enforces a default rate limit of 10 concurrent generation jobs per API key and a maximum of 500 requests per day on the standard developer tier. Upgrading to the Pro tier raises concurrency to 30 simultaneous jobs and 2,000 requests per day. By comparison, Vidu Q1 allowed 20 concurrent jobs on the same standard tier, meaning Q3 applies stricter concurrency controls likely due to higher

Vidu Q3 Reference to Video API: Complete Developer Guide

The Vidu Q3 Reference to Video API lets you anchor video generation to one or more reference images — controlling character appearance, scene composition, or object identity — while still driving motion and narrative through a text prompt. If you’ve been evaluating text-to-video APIs and found pure prompt-based generation too unpredictable for production, reference-based generation is the workflow worth understanding.

This guide covers the technical spec, how reference-to-video differs from Vidu’s earlier models, where it sits against competitors, and what it actually costs.

What’s New in Vidu Q3 vs. Q1

Vidu has shipped two public API generations before Q3: Q1 and Q1 Turbo. The jump to Q3 is not cosmetic.

Capability	Vidu Q1	Vidu Q3
Max resolution	720p	1080p
Resolution options	720p only	540p, 720p, 1080p
Anime-style generation	Not supported	Supported natively
Background music generation	Not supported	Supported
Reference image input	Single image	Multiple reference images
Turbo variant	Q1 Turbo available	Q3 Turbo available

Three of those changes matter practically:

1080p output closes the gap with Kling and Runway for use cases requiring broadcast or high-DPI display output. At 720p you’re borderline; at 1080p you have room to crop or reframe in post.

Multiple reference images is the bigger architectural shift. Q1 took one image and extrapolated. Q3 lets you pass a character reference and a background reference simultaneously, which is the difference between “loosely inspired by” and “consistent with.” For product demos, game cinematics, and character-driven content, multi-reference is a production feature, not a nice-to-have.

Anime-style flag means you’re not fighting the model’s default photorealism when generating stylized content. Without a dedicated mode, anime prompts on generalist models often produce uncanny hybrid outputs.

Full Technical Specification

Parameter	Value
Model identifier	`vidu-q3`
Generation type	Text-to-Video, Reference-to-Video (Image+Text)
Input modalities	Text prompt, reference image URLs
Output resolutions	540p, 720p, 1080p
Supported styles	Realistic, Anime
Audio support	Background music generation (optional)
API pattern	Async: POST to submit → GET to poll results
Request format	JSON over HTTPS
Authentication	Bearer token (API key)
Output format	Video URL (MP4)
Task lifecycle	Submitted → Processing → Completed / Failed
Turbo variant	Vidu Q3 Turbo (faster inference, same resolution ceiling)
Reference image input	Single or multiple URLs
Platforms	Vidu native (`platform.vidu.com`), WaveSpeed AI, fal.ai, Novita AI

The async pattern is standard for video generation — you POST a task and poll for completion. Unlike some APIs that return a synchronous stream, Vidu Q3 gives you a task ID immediately and expects you to GET status. Plan your integration around webhook or polling logic accordingly.

Reference to Video: How It Actually Works

The Reference to Video endpoint (documented at platform.vidu.com/docs/reference-to-video) accepts two sub-modes:

Video Direct Generation — uses reference images to constrain visual identity, outputs video only
Audio & Video Direct Generation — same visual pipeline, but also generates background music synchronized to the output

Both sub-modes share the same request address and request headers. The distinction is a flag in the request body, not a different endpoint. This is a clean design choice: you’re not maintaining two separate integration paths.

From the fal.ai implementation (fal-ai/vidu/reference-to-video), the input schema for reference-to-video includes:

prompt — text description of the desired motion/scene
reference_image_urls — array of image URLs the model uses to anchor visual elements

The model blends the visual identity from reference images with the motion specified in the prompt. This is distinct from image-to-video (where the reference image is frame zero). In reference-to-video, the reference images constrain appearance without necessarily being the first frame.

Benchmark Comparison

Standardized video generation benchmarks are still inconsistent across vendors, and Vidu has not published VBench scores in its current API documentation. The following table uses the best available third-party assessments and documented specifications.

Model	Max Resolution	Multi-Reference Input	Anime Mode	Audio Generation	VBench / Quality Notes
Vidu Q3	1080p	✅ Yes	✅ Yes	✅ Yes	No public VBench; rated high on subject consistency in multi-reference tasks
Kling v2.6 Pro	1080p	❌ Single image	❌ No	❌ No	Strong VBench scores on motion smoothness; no multi-ref
Runway Gen-3 Alpha	1280×768	❌ No	❌ No	❌ No	High motion fidelity; no reference image support
Kling v3.0 Pro	1080p	❌ Single image	❌ No	❌ No	Latest Kling; competitive on realism

What this table tells you: Vidu Q3 is currently the only model in this comparison that combines 1080p output, multi-reference image input, anime-style mode, and audio generation in one API call. If you need any of those last three features, the comparison effectively narrows itself.

Where Vidu Q3 is likely weaker: pure photorealistic motion smoothness in single-shot cinematic sequences. Runway Gen-3 Alpha has been consistently cited for motion quality in head-to-head evaluations. If you’re generating footage where a human walks across a room and nothing stylized is happening, Runway remains a strong choice.

Pricing Comparison

Vidu Q3 is available through multiple platforms. Pricing varies by provider.

Platform	Model	Pricing Model	Notes
Vidu native (`platform.vidu.com`)	Vidu Q3	Credit-based	Native access; refer to platform for current credit costs
WaveSpeed AI	Vidu Q3 Text-to-Video	Per-generation	Developer-focused pricing tier
fal.ai	Vidu Q3 / Reference-to-Video	Per-second of video	Transparent per-second billing
Novita AI	Vidu Q3 Turbo	Per-generation	Turbo variant; lower cost, faster output

Practical note: fal.ai’s per-second billing model is predictable if your output durations are fixed. If you’re generating variable-length clips, credit-based models from the native platform can be cheaper at volume but harder to budget. Novita AI’s Turbo offering is worth evaluating if you’re doing high-volume, lower-fidelity previews before committing to full Q3 renders.

Kling v2.6 Pro on its native platform and Runway Gen-3 Alpha both operate on credit systems with roughly comparable per-clip costs at 5-second 1080p. No provider in this space is significantly cheaper than the others at the same quality tier — the cost differentials are usually under 20% and are more a function of volume discounts than base rates.

Best Use Cases

Character-consistent content series You have a defined character — a mascot, a game NPC, a brand avatar. Pass the character sheet as a reference image, write a motion prompt, get consistent appearances across clips. This is the use case multi-reference input was built for.

Anime and stylized animation studios Small studios generating storyboard animatics or short-form stylized content benefit directly from the anime mode flag. You’re not prompting against the model’s default aesthetic bias.

Product demo videos with branded assets Pass your product image as a reference, prompt the motion (“rotate slowly on a dark surface with subtle light flare”), and the product identity stays locked. Without reference anchoring, the model may hallucinate product geometry.

Social content with background music The audio generation feature covers a gap that typically requires a separate API call to a music generation service. For short-form social content (15–30 seconds), generating synchronized background music in the same request reduces pipeline complexity.

Rapid prototyping via Turbo variant Use Vidu Q3 Turbo on Novita AI for iteration — fast, lower-cost previews before committing to full 1080p Q3 renders. Standard pre-production workflow logic.

Limitations and When NOT to Use Vidu Q3

Do not use for high-fidelity photorealistic human motion. Vidu Q3 has not published VBench motion smoothness scores, and third-party comparisons suggest Runway Gen-3 Alpha and Kling v3.0 Pro outperform it for naturalistic human locomotion. If your use case is a talking-head video or a person walking in a realistic environment, benchmark against those alternatives first.

Do not use if you need synchronous API responses. The POST/GET async pattern adds latency to your integration. If you’re building a user-facing real-time experience where a video needs to appear in under 5 seconds, no current 1080p video generation API will satisfy that — but Vidu Q3 specifically requires polling or webhook infrastructure.

Output format flexibility is limited. The API returns MP4 via URL. If you need specific codecs, frame rates, or container formats, you’ll need a post-processing step. This is not Vidu-specific, but worth stating explicitly.

Reference image quality directly affects output quality. If your reference images are low resolution, poorly lit, or stylistically inconsistent with the prompt, the model output degrades proportionally. This is a workflow constraint, not a model bug, but it means reference-to-video requires more careful asset management than pure text-to-video.

No public SLA or uptime commitment documented for the fal.ai or WaveSpeed integrations at the time of writing. For production applications with availability requirements, factor in fallback logic.

Minimal Working Code Example

Using the fal.ai SDK for the Vidu reference-to-video endpoint:

import fal_client

result = fal_client.subscribe(
    "fal-ai/vidu/reference-to-video",
    input={
        "prompt": "The character walks slowly along a moonlit beach, wind moving through her hair.",
        "reference_image_urls": [
            "https://your-cdn.com/character-reference.jpg"
        ]
    },
    with_logs=True
)

print(result["video"]["url"])

This submits a reference-to-video task, waits for completion (fal.ai’s subscribe handles polling internally), and prints the output video URL. Replace the reference image URL with an accessible hosted image. No additional headers needed beyond the FAL_KEY environment variable set during fal.ai client initialization.

Conclusion

Vidu Q3’s reference-to-video API is technically differentiated by its multi-reference image input, 1080p ceiling, and combined audio generation — features that competing models at the same price tier don’t bundle together. If those capabilities match your use case, it’s worth a direct evaluation; if you need best-in-class photorealistic human motion without stylization requirements, benchmark Runway Gen-3 Alpha or Kling v3.0 Pro before committing.

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Vidu Q3 Reference to Video API: Complete Developer Guide

Vidu Q3 Reference to Video API: Complete Developer Guide

What’s New in Vidu Q3 vs. Q1

Full Technical Specification

Reference to Video: How It Actually Works

Benchmark Comparison

Pricing Comparison

Best Use Cases

Limitations and When NOT to Use Vidu Q3

Minimal Working Code Example

Conclusion

Frequently Asked Questions

Tags

Related Articles

Gemini Flash Image-to-Video API: Complete Developer Guide

Gemini Flash Text-to-Video API: Complete Developer Guide

HappyHorse-1.0 Reference-to-Video API: Developer Guide