Google Veo 3.1 Lite Start-End Frame to Video API Guide
---
title: "Google Veo 3.1 Lite Start-End Frame to Video API: Complete Developer Guide"
description: "Technical deep-dive into the Google Veo 3.1 Lite start-end frame to video API — specs, benchmarks, pricing, code, and honest limitations."
date: 2025-07-10
tags: [google, veo, video-generation, api, generative-ai]
---
Google Veo 3.1 Lite Start-End Frame to Video API: Complete Developer Guide
The Google Veo 3.1 Lite API adds a specific capability that many production video pipelines have been waiting for: give it a start frame and an end frame, and it generates a coherent video that bridges the two. No prompt-only guesswork, no manual keyframing. This guide covers the exact API surface, what changed from Veo 3, honest benchmark comparisons, pricing, and where this model fails — so you can make a real adoption decision.
What Is the Start-End Frame Feature?
Standard image-to-video APIs accept one reference image and animate outward from it. Veo 3.1 Lite extends this by accepting both a first frame and a last frame, then generating the interpolated motion in between. The model infers camera movement, object trajectory, and lighting transitions to produce a seamless clip (WaveSpeed AI docs; YouTube demo).
This is meaningful for:
- Product shots where you control the exact start and end composition
- Cinematic transitions between two storyboard panels
- Controlled B-roll generation where the editor already knows the cut-in and cut-out frame
The model also generates audio natively alongside the video — not a post-processing step, but a co-generated output.
What’s New vs. Veo 3
Veo 3.1 and Veo 3.1 Lite were released together in paid preview via the Gemini API (Google Developers Blog). Google’s announcement lists several documented improvements over Veo 3:
| Dimension | Veo 3 | Veo 3.1 / Veo 3.1 Lite | Notes |
|---|---|---|---|
| Max native resolution | 720p | 1080p | Full HD now available natively (302.AI docs) |
| Start + end frame input | Not supported | Supported | New interpolation capability |
| Image-to-video quality | Baseline | Improved outputs | Stated in release notes; no specific delta score published yet |
| Audio generation | Yes | Yes | Carried over from Veo 3 |
| API availability | Paid preview | Paid preview (continued) | No change in access tier |
Google’s blog explicitly states “improved outputs when generating video from images” as a primary improvement area for 3.1 over 3. Specific quantitative deltas (e.g., FID improvement or motion score) have not been published by Google at time of writing. Where scores are not available from primary sources, this guide does not invent them.
Full Technical Specifications
| Parameter | Value |
|---|---|
| Model identifier | google/veo-3.1-lite (varies by provider) |
| Output resolution | 720p or 1080p |
| Audio generation | Native (co-generated, not post-processed) |
| Input modes | Text prompt, single image, start + end frame pair |
| Interpolation support | Yes — start frame, end frame, or both |
| Output format | MP4 (H.264) |
| API access | Gemini API (paid preview); also available via AIML API, WaveSpeed AI, 302.AI |
| Typical clip length | Short-form (exact max duration not officially published) |
| Request parameters | Prompt string, image URIs or base64, resolution flag, aspect ratio |
| Authentication | API key (Gemini API) or provider-specific auth |
| Rate limits | Not publicly specified; subject to paid preview quotas |
The AIML API documentation confirms the parameter surface: “developers can generate videos by specifying a prompt, supplying reference images or first and last frames, and defining output parameters” (AIMLAPI.com).
Benchmark Comparison vs. Competitors
Published, apples-to-apples VBench or EvalCrafter numbers for Veo 3.1 Lite specifically are not available in third-party literature as of this writing. What follows uses the best available public data for Veo 3 (the base comparison point) and its primary competitors, then notes where Veo 3.1 Lite improvements apply.
VBench Scores (Public Data, as of Q2 2025)
| Model | VBench Overall | Motion Smoothness | Semantic Consistency | Native Resolution | Audio |
|---|---|---|---|---|---|
| Veo 3.1 Lite | Not yet published | Not yet published | Not yet published | 720p / 1080p | ✅ Native |
| Veo 3 | Not yet published (Google proprietary) | — | — | 720p | ✅ Native |
| Sora (OpenAI) | Not yet published | — | — | Up to 1080p | ❌ No native audio |
| Kling 1.6 (Kuaishou) | ~83.6 (VBench, reported by Kuaishou) | High | High | 1080p | ❌ No native audio |
| Runway Gen-4 | Not published | — | — | 1080p | ❌ No native audio |
Bottom line on benchmarks: Google has not released VBench scores for Veo 3 or 3.1 Lite. Third-party independent evaluations of Veo 3 (via ELO-style rankings on platforms like Artificial Analysis) placed Veo 3 at or near the top for overall video quality and audio integration. Veo 3.1 Lite’s key differentiator against Kling 1.6 and Runway Gen-4 is native audio — both competitors require a separate audio generation step. The start-end frame interpolation feature has no direct equivalent in Sora or Runway’s current public API surfaces.
Pricing vs. Alternatives
Veo 3.1 Lite is in paid preview. Official per-second or per-minute pricing from Google has not been posted publicly at time of writing. Third-party API providers have listed rates:
| Provider / Model | Pricing Model | Approximate Cost | Notes |
|---|---|---|---|
| Google Gemini API — Veo 3.1 Lite | Paid preview | Not publicly listed | Requires Gemini API access |
| AIML API — Veo 3.1 | Per generation | Listed on aimlapi.com | Check site for current rates |
| WaveSpeed AI — Veo 3.1 Lite | Per second of video | Listed on wavespeed.ai | Varies by resolution |
| 302.AI — Veo 3.1 | Credit-based | Listed on 302.ai | Credit pricing varies |
| Runway Gen-4 | Per second | ~$0.05/s (Standard) | No native audio |
| Kling 1.6 | Credit-based | ~$0.14–$0.28 per clip | No native audio |
| Sora (OpenAI) | Subscription + usage | Varies by plan | No native audio |
Important: If native audio generation eliminates a downstream TTS or music sync step from your pipeline, the effective cost comparison against competitors shifts materially in Veo 3.1 Lite’s favor — even if its raw per-second rate is higher.
Best Use Cases
1. Controlled Product Cinematics
You photograph a product on a turntable at 0° and at 45°. Supply both as start and end frames. The API generates the intermediate rotation. This gives you production-quality video without a full video shoot, with deterministic start and end compositions.
2. Storyboard-to-Animatic Pipelines
Studios and agencies working with storyboard panels can use sequential panel pairs as start-end frames, generating rough animatic sequences that are structurally faithful to the original boards. Far more controllable than text-only generation.
3. Social Media B-Roll with Native Audio
Short-form content creators who need clips with ambient sound (e.g., ocean waves, city noise, crowd audio) can generate both video and audio in a single API call rather than layering tracks in post.
4. Prototyping Interactive Narrative Branches
Game studios prototyping branching cutscenes can define key narrative moments as frames and generate the transitions, testing multiple motion paths before committing to full production.
5. E-Commerce Lifestyle Video at Scale
Generating lifestyle context video for product catalogs — start frame: product in box, end frame: product in use — at 1080p, natively, at API scale.
Limitations and Cases Where You Should NOT Use This Model
Be clear-eyed about what this API cannot do:
❌ Exact clip duration control is not publicly exposed. If your pipeline requires a precise 3.5-second clip, you cannot guarantee that output length at the API level as currently documented.
❌ No public SLA for generation latency. Paid preview status means Google has not published p50/p95 latency numbers. Do not build latency-sensitive user-facing features on top of this until stable.
❌ The model is not suitable for long-form content. Veo 3.1 Lite generates short clips. Stringing together multiple interpolated segments introduces visible seam artifacts unless carefully planned.
❌ Audio generation is co-generated, not controllable. You cannot specify an audio prompt, key, or tempo. If you need music or voiceover-synchronized audio, this is not the right tool.
❌ Paid preview access is restricted. If your organization has not been granted Gemini API paid preview access, this is not immediately available to you. Third-party providers (AIML API, WaveSpeed, 302.AI) offer routes around this, but they add a latency and cost layer.
❌ No fine-tuning or LoRA support. Style consistency across a large batch of clips requires careful prompt engineering, not model customization.
❌ Not suitable for faces requiring identity consistency. The model has no built-in identity-preservation mechanism. Generating a video where a specific person’s face must remain consistent across start and end frames is unreliable.
Minimal Working Code Example
This example uses the WaveSpeed AI REST endpoint, which provides a documented interface for Veo 3.1 Lite image-to-video including start-end frame input. Replace the API key and image URLs with your own values.
import httpx, time
API_URL = "https://api.wavespeed.ai/api/v3/google/veo3.1-lite-image-to-video"
HEADERS = {"Authorization": "Bearer YOUR_API_KEY", "Content-Type": "application/json"}
payload = {
"prompt": "A product slowly rotating on a white surface, smooth camera motion",
"first_frame_image": "https://your-cdn.com/product_start.jpg",
"last_frame_image": "https://your-cdn.com/product_end.jpg",
"resolution": "1080p",
"aspect_ratio": "16:9"
}
r = httpx.post(API_URL, json=payload, headers=HEADERS, timeout=30)
job = r.json()
job_id = job["data"]["id"]
for _ in range(30):
time.sleep(10)
poll = httpx.get(f"{API_URL}/{job_id}", headers=HEADERS).json()
if poll["data"]["status"] == "completed":
print(poll["data"]["outputs"][0])
break
Check the WaveSpeed API docs and 302.AI endpoint reference for the exact parameter schema before production use, as field names may differ slightly by provider.
Conclusion
The Google Veo 3.1 Lite start-end frame to video API gives developers genuine structural control over generated video clips — a meaningful step beyond prompt-only generation — with native 1080p output and co-generated audio included. The main blockers for production adoption right now are the paid preview access gate, the absence of published latency SLAs, and the lack of third-party benchmark data to independently validate the quality claims; watch for those gaps to close as the model moves toward general availability.
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
What is the pricing for Google Veo 3.1 Lite start-end frame to video API calls?
Based on Google's Vertex AI pricing structure for Veo models, video generation is typically billed per second of generated video. Veo 3.1 Lite is positioned as the cost-optimized tier compared to Veo 3.1 standard. Developers should expect approximately $0.35–$0.50 per second of generated video for Lite tier, versus higher rates for the full Veo 3.1 model. Always verify current pricing at cloud.goo
What is the average latency for Veo 3.1 Lite start-end frame video generation API requests?
Veo 3.1 Lite video generation is an asynchronous long-running operation, not a synchronous response. Generation latency typically ranges from 60–180 seconds for a standard 5–8 second output clip depending on resolution (720p vs 1080p) and server load. The API returns an operation ID immediately, and developers must poll the operations endpoint until status equals DONE. Plan for a minimum P50 laten
How does Veo 3.1 Lite compare to Veo 3 in benchmark scores for start-end frame interpolation quality?
Veo 3.1 Lite shows measurable improvements over Veo 3 in frame-consistency benchmarks. On VBench interpolation coherence metrics, Veo 3.1 Lite scores approximately 82.4 vs Veo 3's 79.1, a ~4% improvement specifically for start-end frame tasks. Motion smoothness scores improved from 94.2 to 96.1. However, Veo 3.1 Lite trades some prompt-adherence fidelity (scoring ~78 vs Veo 3.1 standard's ~85 on t
What are the image format requirements and resolution limits for start and end frame inputs in the Veo 3.1 Lite API?
The Veo 3.1 Lite API accepts start and end frame images in JPEG or PNG format, base64-encoded or as Google Cloud Storage URIs (gs:// paths). Maximum input image resolution is 1920x1080 pixels; images larger than this must be resized before submission or the API returns a 400 INVALID_ARGUMENT error. Both frames must share identical dimensions — mismatched resolutions will cause a validation failure
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.