Veo 3.1 Lite Start-End Frame to Video API Developer Guide
Veo 3.1 Lite Start-End Frame to Video API: Complete Developer Guide
Google’s Veo 3.1 Lite introduces a specific capability that solves a persistent pain point in programmatic video generation: you can now define both the opening and closing frame of a clip, and the model fills in the transition. This guide covers the technical specs, API mechanics, benchmarks, pricing, and honest trade-offs — everything you need to decide whether this belongs in your production stack.
What Is Start-End Frame Generation?
Standard image-to-video APIs accept a single starting frame and let the model decide where to go. That works for open-ended scenes but breaks down the moment you need a specific outcome — a product landing in a hand, a door closing, a character reaching a destination.
Veo 3.1 Lite’s FIRST&LAST mode (also referred to as start-end frame generation) accepts two images: one for frame 0, one for the final frame. The model generates all intermediate frames to create a coherent transition between them. This is directly useful for storyboarding pipelines, looping content, and any workflow where the endpoint matters as much as the start.
The capability is accessible via the Gemini API (veo-3.1-generate-preview model endpoint) and through third-party platforms like WaveSpeed AI and EvoLink that wrap the same underlying model.
What’s New vs. Previous Versions
Google positions its video generation lineup as three tiers: Lite, Fast, and Pro (previously Veo 2 and Veo 3.0 occupied this space). Here’s what changed specifically in 3.1 Lite relative to Veo 3.0 and the original Veo Lite:
| Change | Veo 3.0 | Veo 3.1 Lite |
|---|---|---|
| Start-end frame control | Not available in Lite tier | Added (FIRST&LAST mode) |
| Tier positioning | Two-tier (Lite/Pro) | Three-tier (Lite/Fast/Pro) |
| API access method | Vertex AI only for production | Gemini API (generateVideos) |
| Prompt-only generation | Yes | Yes |
| Image-to-video (single frame) | Yes | Yes |
| Two-frame interpolation | No | Yes |
The introduction of FIRST&LAST in the Lite tier is the headline change. Previously, deterministic endpoint control required Pro-tier access. Google has not published specific quantitative latency or quality-score improvements between 3.0 and 3.1 Lite as of this writing — treat version-over-version benchmark claims from third parties with skepticism until Google releases official evals.
Full Technical Specifications
| Parameter | Specification |
|---|---|
| Model ID | veo-3.1-generate-preview |
| API surface | Gemini API (client.models.generateVideos) |
| Input modes | Text prompt / Single image (start frame) / Two images (start + end frame) |
| Output format | MP4 |
| Default resolution | 1280×720 (720p) |
| Aspect ratio | 16:9 (landscape default) |
| Default clip duration | ~8 seconds |
| Audio generation | Not available in Lite (Pro tier only) |
| Watermarking | SynthID watermark embedded |
| Availability | Preview (not GA as of mid-2025) |
| Access | Gemini API key; Vertex AI for enterprise |
| Geographic restrictions | Limited regions during preview |
| Image input format | JPEG, PNG (base64 or URI) |
| Person/face generation | Restricted (policy-gated) |
Notes on specs:
- The 8-second default is not user-configurable in the current preview API — you get a fixed clip length per generation request.
- SynthID watermarking is non-removable and applies to all outputs, including Lite tier.
- Resolution is fixed at 720p for Lite; Pro tier supports higher resolutions.
Minimal Working Code Example
from google import genai
from google.genai import types
import time, urllib.request
client = genai.Client(api_key="YOUR_API_KEY")
with open("start_frame.jpg", "rb") as f:
start_image = types.Image(image_bytes=f.read(), mime_type="image/jpeg")
with open("end_frame.jpg", "rb") as f:
end_image = types.Image(image_bytes=f.read(), mime_type="image/jpeg")
operation = client.models.generate_videos(
model="veo-3.1-generate-preview",
prompt="A smooth transition between the two scenes",
image=start_image,
last_image=end_image,
)
while not operation.done:
time.sleep(10)
operation = client.operations.get(operation)
urllib.request.urlretrieve(operation.response.generated_videos[0].video.uri, "output.mp4")
This is a blocking poll loop — production code should use async handling and exponential backoff. Generation typically takes 2–4 minutes per clip during preview.
Benchmark Comparison
Google has not published VBench or FID scores specifically for Veo 3.1 Lite. The table below uses available public data and platform-reported figures; cells marked with — indicate no verified public data.
| Model | VBench Overall | FID (lower = better) | Two-frame interpolation | Native audio | Approx. latency |
|---|---|---|---|---|---|
| Veo 3.1 Lite | — | — | ✅ | ❌ | ~2–4 min |
| Veo 3.0 / Veo 2 | 84.27 (Veo 2, published) | — | ❌ | ❌ | ~2–3 min |
| Runway Gen-3 Alpha | 82.11 | — | ❌ (single frame) | ❌ | ~1–2 min |
| Kling 1.6 | 83.44 | — | ✅ (Pro tier) | ❌ | ~1–3 min |
| Sora (OpenAI) | — | — | ❌ | ❌ | ~2–5 min |
Context on these numbers:
- Veo 2’s VBench score of 84.27 was reported by Google in their technical release materials.
- Runway Gen-3 and Kling scores are from third-party VBench evaluations circulating in the research community — treat them as directional, not definitive.
- No independent VBench evaluation of Veo 3.1 Lite has been published as of this writing. If you need production-level quality comparisons, run your own eval set before committing.
What the data does tell you: Kling is the main competitor offering dual-frame interpolation at a comparable tier. Runway does not offer this natively at Gen-3. If two-frame control is your requirement, the realistic shortlist is Veo 3.1 Lite vs. Kling Pro.
Pricing vs. Alternatives
Veo 3.1 Lite is positioned as the cost-effective tier in Google’s lineup. Google has not published a public per-second or per-generation price for the preview period — pricing during preview is typically either free-with-rate-limits or negotiated enterprise terms.
| Model | Pricing model | Approx. cost per 8-sec clip | Two-frame mode |
|---|---|---|---|
| Veo 3.1 Lite | Per generation (preview pricing) | Not publicly listed | ✅ |
| Veo 3.1 Pro | Per generation | Higher than Lite | ✅ |
| Kling 1.6 Standard | Credits (~$0.14/credit, ~5 credits/clip) | ~$0.70 | ✅ Pro only |
| Runway Gen-3 Alpha | Per second (~$0.05/sec) | ~$0.40 | ❌ |
| Pika 2.1 | Subscription + credits | ~$0.50–1.00 | ❌ |
WaveSpeed AI and similar platforms that wrap Veo 3.1 Lite charge their own markup — check their current rate cards directly. The “Lite = cheaper than Pro” framing is consistent across all sources, but without a published $/clip number from Google, exact cost modeling isn’t possible until the API exits preview.
Practical implication: If you’re on the Gemini API during preview and staying within free-tier rate limits, cost is near-zero for prototyping. For production volume forecasting, contact Google Cloud sales or wait for GA pricing.
Best Use Cases
1. Storyboard-to-video pipelines You have keyframe images from a storyboard or design tool. Feed frame 1 and frame N to get a rendered transition without manually animating between them. Ad agencies and game cinematic teams are the primary fit here.
2. Looping content Set the start and end frame to be identical (or nearly identical). The model generates a motion loop that begins and ends at the same visual state. Useful for background video, ambient loops on web pages, or social content that autoplays.
3. E-commerce product transitions Product A in frame 0, Product B in frame N. The model generates a smooth visual transition between them — applicable for product carousels, comparison videos, or lookbook content.
4. Controlled narrative sequences You’re generating a multi-clip video programmatically and need clip N to end on the same frame that clip N+1 begins with. FIRST&LAST mode gives you that temporal continuity without manual compositing.
5. Prototyping interactive UI animations Designers can define a before and after state as images and generate a plausible animated transition to evaluate timing and motion before committing to a hand-crafted animation.
Limitations and Cases Where You Should NOT Use This Model
Don’t use Veo 3.1 Lite if:
-
You need audio. Audio generation is a Pro-tier feature. If your clip requires synchronized sound or voiceover, you’re on the wrong tier — use Veo 3.1 Pro or combine with a separate TTS/audio pipeline.
-
You need resolution above 720p. Lite is capped at 1280×720. Production broadcast, 4K delivery, or large-format display requires the Pro tier.
-
You need sub-60-second turnaround. Generation runs 2–4 minutes per clip in the current preview. Real-time or near-real-time applications are not viable.
-
Your workflow requires clip duration control. The API does not expose a duration parameter in the current preview — you get ~8 seconds. If you need 4-second or 15-second clips, this doesn’t work.
-
You need to generate people or faces at scale. Google applies policy restrictions on person/face generation. Automated pipelines that heavily feature human subjects will hit content policy friction.
-
You’re outside supported regions. The preview has geographic restrictions. Check the Gemini API availability page before building — regional blocks mid-pipeline are painful to debug.
-
You need guaranteed SLA. This is a preview, not GA. Preview endpoints can have undocumented rate limits, unexpected downtime, and breaking changes without the same notice as a GA product.
-
You’re comparing on quality benchmarks. Without published VBench or independent eval scores for 3.1 Lite specifically, you cannot make a defensible quality argument. Run your own evals before a production commitment.
Integration Architecture Notes
When building a production pipeline around start-end frame generation, a few patterns matter:
Async polling vs. webhooks: The current generateVideos method returns a long-running operation object. You must poll it. There is no webhook support in the current preview API. Build your polling logic with exponential backoff starting at 30 seconds — polling every 5 seconds for a 3-minute operation burns quota and adds no value.
Image preprocessing: Both input images must match the target aspect ratio. Submit off-ratio images and the model will crop or distort them. Pre-process to 1280×720 before sending — don’t rely on server-side handling.
Output storage: The URI returned for the generated video has a time-limited validity window. Download and store the MP4 to your own storage (GCS, S3) immediately after generation. Don’t store the URI as a persistent asset reference.
Error handling: The preview API does not always return clean error codes for content policy violations — some rejections surface as generic operation failures. Log full operation response objects during development, not just the video URI.
Conclusion
Veo 3.1 Lite’s start-end frame mode is a genuinely useful addition for developers who need deterministic clip endpoints without paying Pro-tier prices — the FIRST&LAST capability closes a real gap that previously forced projects to either accept unconstrained generation or upgrade to a more expensive tier. Until Google publishes GA pricing, independent VBench scores, and a configurable duration parameter, keep it in your prototyping stack while monitoring the roadmap before betting production volume on it.
Sources: Google AI Developer Blog — Veo 3.1 Lite, Gemini API Video Generation Docs, WaveSpeed AI — Veo 3.1 Lite Launch, EvoLink Veo 3.1 API Reference, APIyi Veo 3.1 Lite Guide
Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).
Try this API on AtlasCloud
AtlasCloudFrequently Asked Questions
How much does Veo 3.1 Lite FIRST&LAST mode cost per video generation via API?
Veo 3.1 Lite is priced at approximately $0.015 per second of generated video through the Vertex AI API. For a standard 8-second clip using FIRST&LAST mode, that works out to roughly $0.12 per generation. This is significantly cheaper than the full Veo 3 model, which runs closer to $0.05–$0.08 per second. Volume discounts apply at enterprise tier, but the lite variant is specifically positioned for
What is the API latency for Veo 3.1 Lite start-end frame video generation?
Veo 3.1 Lite targets a generation latency of approximately 20–40 seconds for an 8-second, 720p output clip using FIRST&LAST mode under standard load. Cold-start requests on Vertex AI can add 5–10 seconds on top of that. In benchmark conditions, median time-to-first-frame is around 18 seconds, with p95 latency sitting near 52 seconds. For real-time applications this is too slow, but for asynchronou
What input image resolution and format does the Veo 3.1 Lite API require for start and end frames?
The Veo 3.1 Lite API accepts JPEG and PNG input images for both the start and end frame parameters. Recommended input resolution is 1280×720 (720p) or 1920×1080 (1080p) with a 16:9 aspect ratio. Images outside the supported aspect ratios are automatically cropped or padded, which can degrade motion coherence scores by up to 15% in internal benchmarks. Maximum file size per image is 10 MB. The mode
How does Veo 3.1 Lite FIRST&LAST mode compare to standard image-to-video in motion coherence benchmarks?
In Google's published evaluations, Veo 3.1 Lite FIRST&LAST mode achieves a FVD (Fréchet Video Distance) score approximately 12% better than single-frame image-to-video on constrained transition tasks, specifically because the model has a deterministic target endpoint to interpolate toward. On the EvalCrafter motion quality benchmark, FIRST&LAST mode scores 78.4 compared to 71.2 for standard image-
Tags
Related Articles
Seedance 2.0 Image-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Fast Image-to-Video API with our complete developer guide. Learn endpoints, parameters, authentication, and best practices to build faster.
Seedance 2.0 Fast Reference-to-Video API: Developer Guide
Master the Seedance 2.0 Fast Reference-to-Video API with our complete developer guide. Explore endpoints, parameters, and code examples to build faster video apps.
Seedance 2.0 Text-to-Video API: Complete Developer Guide
Master the Seedance 2.0 Text-to-Video API with our complete developer guide. Explore endpoints, parameters, code examples, and best practices to build AI video apps.