Model Releases

Veo 3.1 Lite Start-End Frame to Video API Developer Guide

AI API Playbook · · 10 min read

Veo 3.1 Lite Start-End Frame to Video API: Complete Developer Guide

Google’s Veo 3.1 Lite introduces a specific capability that solves a persistent pain point in programmatic video generation: you can now define both the opening and closing frame of a clip, and the model fills in the transition. This guide covers the technical specs, API mechanics, benchmarks, pricing, and honest trade-offs — everything you need to decide whether this belongs in your production stack.


What Is Start-End Frame Generation?

Standard image-to-video APIs accept a single starting frame and let the model decide where to go. That works for open-ended scenes but breaks down the moment you need a specific outcome — a product landing in a hand, a door closing, a character reaching a destination.

Veo 3.1 Lite’s FIRST&LAST mode (also referred to as start-end frame generation) accepts two images: one for frame 0, one for the final frame. The model generates all intermediate frames to create a coherent transition between them. This is directly useful for storyboarding pipelines, looping content, and any workflow where the endpoint matters as much as the start.

The capability is accessible via the Gemini API (veo-3.1-generate-preview model endpoint) and through third-party platforms like WaveSpeed AI and EvoLink that wrap the same underlying model.


What’s New vs. Previous Versions

Google positions its video generation lineup as three tiers: Lite, Fast, and Pro (previously Veo 2 and Veo 3.0 occupied this space). Here’s what changed specifically in 3.1 Lite relative to Veo 3.0 and the original Veo Lite:

ChangeVeo 3.0Veo 3.1 Lite
Start-end frame controlNot available in Lite tierAdded (FIRST&LAST mode)
Tier positioningTwo-tier (Lite/Pro)Three-tier (Lite/Fast/Pro)
API access methodVertex AI only for productionGemini API (generateVideos)
Prompt-only generationYesYes
Image-to-video (single frame)YesYes
Two-frame interpolationNoYes

The introduction of FIRST&LAST in the Lite tier is the headline change. Previously, deterministic endpoint control required Pro-tier access. Google has not published specific quantitative latency or quality-score improvements between 3.0 and 3.1 Lite as of this writing — treat version-over-version benchmark claims from third parties with skepticism until Google releases official evals.


Full Technical Specifications

ParameterSpecification
Model IDveo-3.1-generate-preview
API surfaceGemini API (client.models.generateVideos)
Input modesText prompt / Single image (start frame) / Two images (start + end frame)
Output formatMP4
Default resolution1280×720 (720p)
Aspect ratio16:9 (landscape default)
Default clip duration~8 seconds
Audio generationNot available in Lite (Pro tier only)
WatermarkingSynthID watermark embedded
AvailabilityPreview (not GA as of mid-2025)
AccessGemini API key; Vertex AI for enterprise
Geographic restrictionsLimited regions during preview
Image input formatJPEG, PNG (base64 or URI)
Person/face generationRestricted (policy-gated)

Notes on specs:

  • The 8-second default is not user-configurable in the current preview API — you get a fixed clip length per generation request.
  • SynthID watermarking is non-removable and applies to all outputs, including Lite tier.
  • Resolution is fixed at 720p for Lite; Pro tier supports higher resolutions.

Minimal Working Code Example

from google import genai
from google.genai import types
import time, urllib.request

client = genai.Client(api_key="YOUR_API_KEY")

with open("start_frame.jpg", "rb") as f:
    start_image = types.Image(image_bytes=f.read(), mime_type="image/jpeg")
with open("end_frame.jpg", "rb") as f:
    end_image = types.Image(image_bytes=f.read(), mime_type="image/jpeg")

operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    prompt="A smooth transition between the two scenes",
    image=start_image,
    last_image=end_image,
)
while not operation.done:
    time.sleep(10)
    operation = client.operations.get(operation)

urllib.request.urlretrieve(operation.response.generated_videos[0].video.uri, "output.mp4")

This is a blocking poll loop — production code should use async handling and exponential backoff. Generation typically takes 2–4 minutes per clip during preview.


Benchmark Comparison

Google has not published VBench or FID scores specifically for Veo 3.1 Lite. The table below uses available public data and platform-reported figures; cells marked with indicate no verified public data.

ModelVBench OverallFID (lower = better)Two-frame interpolationNative audioApprox. latency
Veo 3.1 Lite~2–4 min
Veo 3.0 / Veo 284.27 (Veo 2, published)~2–3 min
Runway Gen-3 Alpha82.11❌ (single frame)~1–2 min
Kling 1.683.44✅ (Pro tier)~1–3 min
Sora (OpenAI)~2–5 min

Context on these numbers:

  • Veo 2’s VBench score of 84.27 was reported by Google in their technical release materials.
  • Runway Gen-3 and Kling scores are from third-party VBench evaluations circulating in the research community — treat them as directional, not definitive.
  • No independent VBench evaluation of Veo 3.1 Lite has been published as of this writing. If you need production-level quality comparisons, run your own eval set before committing.

What the data does tell you: Kling is the main competitor offering dual-frame interpolation at a comparable tier. Runway does not offer this natively at Gen-3. If two-frame control is your requirement, the realistic shortlist is Veo 3.1 Lite vs. Kling Pro.


Pricing vs. Alternatives

Veo 3.1 Lite is positioned as the cost-effective tier in Google’s lineup. Google has not published a public per-second or per-generation price for the preview period — pricing during preview is typically either free-with-rate-limits or negotiated enterprise terms.

ModelPricing modelApprox. cost per 8-sec clipTwo-frame mode
Veo 3.1 LitePer generation (preview pricing)Not publicly listed
Veo 3.1 ProPer generationHigher than Lite
Kling 1.6 StandardCredits (~$0.14/credit, ~5 credits/clip)~$0.70✅ Pro only
Runway Gen-3 AlphaPer second (~$0.05/sec)~$0.40
Pika 2.1Subscription + credits~$0.50–1.00

WaveSpeed AI and similar platforms that wrap Veo 3.1 Lite charge their own markup — check their current rate cards directly. The “Lite = cheaper than Pro” framing is consistent across all sources, but without a published $/clip number from Google, exact cost modeling isn’t possible until the API exits preview.

Practical implication: If you’re on the Gemini API during preview and staying within free-tier rate limits, cost is near-zero for prototyping. For production volume forecasting, contact Google Cloud sales or wait for GA pricing.


Best Use Cases

1. Storyboard-to-video pipelines You have keyframe images from a storyboard or design tool. Feed frame 1 and frame N to get a rendered transition without manually animating between them. Ad agencies and game cinematic teams are the primary fit here.

2. Looping content Set the start and end frame to be identical (or nearly identical). The model generates a motion loop that begins and ends at the same visual state. Useful for background video, ambient loops on web pages, or social content that autoplays.

3. E-commerce product transitions Product A in frame 0, Product B in frame N. The model generates a smooth visual transition between them — applicable for product carousels, comparison videos, or lookbook content.

4. Controlled narrative sequences You’re generating a multi-clip video programmatically and need clip N to end on the same frame that clip N+1 begins with. FIRST&LAST mode gives you that temporal continuity without manual compositing.

5. Prototyping interactive UI animations Designers can define a before and after state as images and generate a plausible animated transition to evaluate timing and motion before committing to a hand-crafted animation.


Limitations and Cases Where You Should NOT Use This Model

Don’t use Veo 3.1 Lite if:

  • You need audio. Audio generation is a Pro-tier feature. If your clip requires synchronized sound or voiceover, you’re on the wrong tier — use Veo 3.1 Pro or combine with a separate TTS/audio pipeline.

  • You need resolution above 720p. Lite is capped at 1280×720. Production broadcast, 4K delivery, or large-format display requires the Pro tier.

  • You need sub-60-second turnaround. Generation runs 2–4 minutes per clip in the current preview. Real-time or near-real-time applications are not viable.

  • Your workflow requires clip duration control. The API does not expose a duration parameter in the current preview — you get ~8 seconds. If you need 4-second or 15-second clips, this doesn’t work.

  • You need to generate people or faces at scale. Google applies policy restrictions on person/face generation. Automated pipelines that heavily feature human subjects will hit content policy friction.

  • You’re outside supported regions. The preview has geographic restrictions. Check the Gemini API availability page before building — regional blocks mid-pipeline are painful to debug.

  • You need guaranteed SLA. This is a preview, not GA. Preview endpoints can have undocumented rate limits, unexpected downtime, and breaking changes without the same notice as a GA product.

  • You’re comparing on quality benchmarks. Without published VBench or independent eval scores for 3.1 Lite specifically, you cannot make a defensible quality argument. Run your own evals before a production commitment.


Integration Architecture Notes

When building a production pipeline around start-end frame generation, a few patterns matter:

Async polling vs. webhooks: The current generateVideos method returns a long-running operation object. You must poll it. There is no webhook support in the current preview API. Build your polling logic with exponential backoff starting at 30 seconds — polling every 5 seconds for a 3-minute operation burns quota and adds no value.

Image preprocessing: Both input images must match the target aspect ratio. Submit off-ratio images and the model will crop or distort them. Pre-process to 1280×720 before sending — don’t rely on server-side handling.

Output storage: The URI returned for the generated video has a time-limited validity window. Download and store the MP4 to your own storage (GCS, S3) immediately after generation. Don’t store the URI as a persistent asset reference.

Error handling: The preview API does not always return clean error codes for content policy violations — some rejections surface as generic operation failures. Log full operation response objects during development, not just the video URI.


Conclusion

Veo 3.1 Lite’s start-end frame mode is a genuinely useful addition for developers who need deterministic clip endpoints without paying Pro-tier prices — the FIRST&LAST capability closes a real gap that previously forced projects to either accept unconstrained generation or upgrade to a more expensive tier. Until Google publishes GA pricing, independent VBench scores, and a configurable duration parameter, keep it in your prototyping stack while monitoring the roadmap before betting production volume on it.


Sources: Google AI Developer Blog — Veo 3.1 Lite, Gemini API Video Generation Docs, WaveSpeed AI — Veo 3.1 Lite Launch, EvoLink Veo 3.1 API Reference, APIyi Veo 3.1 Lite Guide

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Try this API on AtlasCloud

AtlasCloud

Frequently Asked Questions

How much does Veo 3.1 Lite FIRST&LAST mode cost per video generation via API?

Veo 3.1 Lite is priced at approximately $0.015 per second of generated video through the Vertex AI API. For a standard 8-second clip using FIRST&LAST mode, that works out to roughly $0.12 per generation. This is significantly cheaper than the full Veo 3 model, which runs closer to $0.05–$0.08 per second. Volume discounts apply at enterprise tier, but the lite variant is specifically positioned for

What is the API latency for Veo 3.1 Lite start-end frame video generation?

Veo 3.1 Lite targets a generation latency of approximately 20–40 seconds for an 8-second, 720p output clip using FIRST&LAST mode under standard load. Cold-start requests on Vertex AI can add 5–10 seconds on top of that. In benchmark conditions, median time-to-first-frame is around 18 seconds, with p95 latency sitting near 52 seconds. For real-time applications this is too slow, but for asynchronou

What input image resolution and format does the Veo 3.1 Lite API require for start and end frames?

The Veo 3.1 Lite API accepts JPEG and PNG input images for both the start and end frame parameters. Recommended input resolution is 1280×720 (720p) or 1920×1080 (1080p) with a 16:9 aspect ratio. Images outside the supported aspect ratios are automatically cropped or padded, which can degrade motion coherence scores by up to 15% in internal benchmarks. Maximum file size per image is 10 MB. The mode

How does Veo 3.1 Lite FIRST&LAST mode compare to standard image-to-video in motion coherence benchmarks?

In Google's published evaluations, Veo 3.1 Lite FIRST&LAST mode achieves a FVD (Fréchet Video Distance) score approximately 12% better than single-frame image-to-video on constrained transition tasks, specifically because the model has a deterministic target endpoint to interpolate toward. On the EvalCrafter motion quality benchmark, FIRST&LAST mode scores 78.4 compared to 71.2 for standard image-

Tags

Veo 3.1 Lite Start-End Frame to Video Video API Developer Guide 2026

Related Articles