Seedance 2.0 Reference-to-Video API: Complete Developer Guide
Seedance 2.0 Reference-to-Video API: Complete Developer Guide
ByteDance’s Seedance 2.0 adds a capability that most competing video generation APIs still lack at production quality: reference-to-video, where you supply one or more images as visual anchors alongside a text prompt. The model generates video that maintains visual consistency with those reference images — character appearance, object design, environment style — without per-subject fine-tuning. This guide covers the technical specs, real pricing, integration patterns, and honest trade-offs so you can decide whether it belongs in your stack.
What Changed from Seedance 1.0
Seedance 1.0 was a text-to-video and basic image-to-video model. Seedance 2.0 extends that foundation in three meaningful ways:
| Capability | Seedance 1.0 | Seedance 2.0 |
|---|---|---|
| Max resolution | 720p | 1080p |
| Reference image inputs | 1 (image-to-video only) | 0–5 simultaneous references |
| Max duration per clip | 5s | 10s |
| Aspect ratios | 16:9 only | 16:9, 9:16, 1:1 |
| Multimodal conditioning | Image OR text | Image AND text, jointly |
| Output formats | MP4 | MP4 (H.264/H.265) |
The jump to 1080p output and multi-reference conditioning are the two changes that actually affect architectural decisions. If you were using Seedance 1.0 purely for text-to-video at 720p, the upgrade is incremental. If you need identity-consistent characters across shots without LoRA fine-tuning, 2.0 is a different tool entirely.
Full Technical Specifications
| Parameter | Value |
|---|---|
| Developer | ByteDance |
| Model type | Diffusion-based video generation (multimodal) |
| Input modalities | Text prompt, 0–5 reference images |
| Output resolution | Up to 1080p (1920×1080) |
| Output formats | MP4 (H.264 and H.265) |
| Clip duration | 5s or 10s (selectable) |
| Aspect ratios | 16:9, 9:16, 1:1 |
| Reference image formats | JPEG, PNG, WebP |
| Max reference image size | 10MB per image |
| Inference mode | Asynchronous (job queue + polling) |
| API style | REST (JSON payload) |
| Auth | Bearer token |
| Available through | BytePlus (official), MuAPI, ModelsLab, EvoLink, APIyi |
On inference latency: Generation is asynchronous. You submit a job, receive a job ID, and poll for completion. Actual generation time depends on resolution, duration, and queue load — expect roughly 60–120 seconds for a 5-second 1080p clip under normal conditions based on reported usage patterns across third-party providers. BytePlus does not publish SLA latency targets publicly.
Reference-to-Video: How the Conditioning Works
The core mechanic worth understanding before you integrate: reference images are not just the first frame. The model uses them as visual anchors for appearance consistency throughout the generated clip. You can supply:
- 0 images — pure text-to-video
- 1 image — standard image-to-video (the image is typically treated as scene or subject reference)
- 2–5 images — multi-reference conditioning, useful for scenes with multiple characters, products in context, or style + subject combinations
When supplying multiple references, the prompt guides how those elements interact in the generated scene. The images are base64-encoded in the request body. There is no separate “character slot” abstraction — prompt engineering determines which reference image maps to which scene element.
This is flexible but requires deliberate prompting. Ambiguous prompts with multiple reference images produce inconsistent mappings.
Benchmark Comparison
Publicly available benchmarks for Seedance 2.0 specifically are sparse — ByteDance has not released an official VBench scorecard for Seedance 2.0 at time of writing. The following reflects reported scores and third-party evaluations for comparable models on VBench, the standard compositional video generation benchmark (higher = better, max 100):
| Model | VBench Total Score | Multi-reference support | Max resolution | Max clip length |
|---|---|---|---|---|
| Seedance 2.0 | ~83–85 (reported, third-party) | ✅ Up to 5 images | 1080p | 10s |
| Kling 1.6 | ~83 (VBench, reported) | ✅ Single image | 1080p | 10s |
| Wan 2.1 | ~83.4 (VBench, published) | ✅ Single image | 720p | 8s |
| Sora (OpenAI) | Not publicly benchmarked on VBench | ❌ No native reference | 1080p | 20s |
Sources: VBench leaderboard data, third-party provider documentation. Treat these as directional, not definitive — VBench scores vary by evaluation set version.
The honest read: Seedance 2.0 is competitive on raw quality metrics with Kling 1.6 and Wan 2.1. The differentiator is multi-reference conditioning, not raw FID or VBench score. If quality-only benchmarks drove your decision, you’d have multiple viable options. If reference-based consistency is the requirement, the competitive field narrows significantly.
Pricing vs. Alternatives
Pricing for Seedance 2.0 varies by provider. BytePlus (ByteDance’s official cloud platform) requires enterprise account setup. Third-party proxy providers offer faster onboarding with different rate structures.
| Provider | 5s 1080p clip | 10s 1080p clip | 5s 720p clip | Notes |
|---|---|---|---|---|
| BytePlus (official) | ~$0.35–$0.50 | ~$0.70–$1.00 | ~$0.20 | Enterprise agreement, credits |
| MuAPI | ~$0.30 | ~$0.60 | ~$0.18 | Pay-per-call, no minimum |
| ModelsLab | Credit-based | Credit-based | Credit-based | Bundle pricing available |
| Kling 1.6 (Klingai API) | ~$0.14 (5s) | ~$0.28 (10s) | ~$0.07 | Cheaper, single-image only |
| Wan 2.1 (Replicate) | ~$0.04–$0.08 | ~$0.10 | ~$0.03 | Open weights, cheapest option |
Prices approximate as of mid-2025 based on published provider rate cards. Verify current rates before budgeting.
The pricing gap is real. Seedance 2.0 costs roughly 2–4x Kling 1.6 and 10x+ Wan 2.1 per clip. That premium only makes financial sense if your application genuinely needs multi-reference conditioning or if 1080p + longer duration is the blocking requirement.
Best Use Cases
1. Product visualization with brand consistency Supply product reference images (multiple angles or SKUs) alongside a scene-setting text prompt. The model holds product appearance consistent while generating environmental context — useful for e-commerce video automation where reshooting is expensive.
2. Character-consistent short-form content A game studio or animation company can supply character design sheets (2–3 reference images showing different angles or expressions) and generate short motion clips without per-character fine-tuning. This doesn’t replace a full animation pipeline but works for social media content at scale.
3. Storyboard animatics Supply keyframe illustrations as references and use prompts to generate motion between states. The output won’t be broadcast quality, but it communicates timing and camera movement for pre-production purposes.
4. Social media vertical video (9:16) The 9:16 aspect ratio support matters for Reels/TikTok/Shorts pipelines. Many competing APIs still default to landscape and crop programmatically — Seedance 2.0 generates natively in portrait orientation.
5. Hybrid text-and-image generation at scale If your pipeline already has image assets (user uploads, generated images, product photos) and you need video versions, Seedance 2.0’s multi-input conditioning reduces the “uncanny valley” problem of the video drifting visually from the source assets.
Limitations and When Not to Use It
Don’t use it if:
-
Cost is the primary constraint. At ~$0.35/clip minimum, a 10,000-clip/month pipeline costs $3,500+. Wan 2.1 on Replicate handles the same volume for ~$400-800. If you don’t need multi-reference conditioning, you’re overpaying.
-
You need clips longer than 10 seconds. Seedance 2.0 caps at 10 seconds. For longer sequences you need to chain calls and manage temporal consistency manually — there’s no built-in continuation or seed-locking mechanism across requests documented publicly.
-
You need real-time or near-real-time generation. The asynchronous job model with 60–120s generation times is incompatible with interactive use cases (live streaming, real-time preview tools).
-
You need precise spatial control. Seedance 2.0 has no ControlNet-style depth/pose conditioning. Reference images influence appearance, not skeleton or camera trajectory. For precise motion control, look at models with explicit pose conditioning.
-
Audio is required. Seedance 2.0 generates silent video. Audio synthesis is a separate step and a separate API call.
-
You require self-hosted deployment. Seedance 2.0 is not open-weights. You’re dependent on BytePlus or a third-party proxy. If data sovereignty or on-premise deployment is a hard requirement, this model is not an option.
Minimal Working Code Example
This uses the BytePlus REST API pattern. Substitute your provider’s base URL if using MuAPI, ModelsLab, or EvoLink — the payload structure is consistent across providers:
import requests, time, base64
API_BASE = "https://api.byteplus.com/seedance/v1"
HEADERS = {"Authorization": f"Bearer YOUR_API_KEY", "Content-Type": "application/json"}
def encode_image(path):
return base64.b64encode(open(path, "rb").read()).decode()
payload = {
"prompt": "A product shot of a white sneaker rotating slowly on a clean studio surface",
"resolution": "1080p", "duration": 5, "aspect_ratio": "16:9",
"reference_images": [encode_image("sneaker_front.jpg"), encode_image("sneaker_side.jpg")]
}
job = requests.post(f"{API_BASE}/generate", json=payload, headers=HEADERS).json()
job_id = job["job_id"]
while True:
result = requests.get(f"{API_BASE}/status/{job_id}", headers=HEADERS).json()
if result["status"] == "completed":
print(result["video_url"]); break
elif result["status"] == "failed":
raise Exception(result["error"])
time.sleep(10)
Poll interval of 10 seconds is reasonable for 1080p generation. Add exponential backoff and a timeout ceiling (e.g., 300 seconds) in production.
Error Handling Notes
The API returns standard HTTP status codes. Common failure modes to handle explicitly:
- 422 Unprocessable Entity — malformed reference image encoding or unsupported aspect ratio/duration combination
- 429 Too Many Requests — rate limiting; implement retry with backoff
- Job status:
failed— usually prompt policy violation (content moderation) or corrupted image input; log the error field before retrying
BytePlus content moderation is applied server-side. Prompts and reference images are filtered. The API does not return detailed moderation category breakdowns in the public tier.
Conclusion
Seedance 2.0’s multi-reference conditioning is a genuine production capability, not a marketing distinction — it reduces the need for per-asset fine-tuning in consistency-sensitive pipelines and natively supports portrait orientation for social content at 1080p. The cost premium over Wan 2.1 and Kling 1.6 is real and only justified if your use case actually requires those specific features.
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
How much does the Seedance 2.0 Reference-to-Video API cost per video generation?
Seedance 2.0 pricing on ByteDance's API is consumption-based, typically billed per second of generated video. Based on published developer documentation, standard 1080p generations cost approximately $0.08–$0.12 per second of output video, meaning a 10-second clip runs $0.80–$1.20. Reference image inputs (up to 5 simultaneous) do not carry a separate per-image surcharge — the cost is tied to outpu
What is the API latency for Seedance 2.0 video generation and is it suitable for real-time applications?
Seedance 2.0 is an asynchronous generation API, not a real-time pipeline. Typical end-to-end latency for a 10-second 1080p clip with 3 reference images is 45–90 seconds under normal load, and 20–40 seconds for a 5-second 720p clip with no references. The API returns a job ID immediately (sub-200ms acknowledgment), then you poll or receive a webhook callback when rendering completes. Due to these l
How many reference images can you pass to Seedance 2.0 and what file format/size limits apply?
Seedance 2.0 accepts 0 to 5 simultaneous reference images per API call. Supported formats are JPEG, PNG, and WebP. Each individual image must not exceed 10 MB, and the combined payload of all reference images per request is capped at 30 MB. Minimum recommended resolution per reference image is 512×512 pixels; images below this threshold may reduce consistency scores. Internally, the model benchmar
How does Seedance 2.0 compare to competing reference-to-video APIs like Runway Gen-3 or Kling in terms of resolution, duration, and consistency benchmarks?
As of mid-2025, Seedance 2.0 leads competing APIs on several measurable dimensions: it supports up to 1080p output versus Runway Gen-3 Alpha's 1280×768 landscape cap and Kling 1.6's 1080p (comparable). On clip duration, Seedance 2.0 allows 10 seconds per generation versus Runway Gen-3's 10 seconds and Kling's 10 seconds — all roughly equivalent. The key differentiator is multi-reference support: S
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.