Model Releases

Google Veo 3.1 Lite Start-End Frame to Video API Guide

AI API Playbook · · 9 min read
---
title: "Google Veo 3.1 Lite Start-End Frame to Video API: Complete Developer Guide"
description: "Technical deep-dive into the Google Veo 3.1 Lite start-end frame to video API — specs, benchmarks, pricing, code, and honest limitations."
date: 2025-07-10
tags: [google, veo, video-generation, api, generative-ai]
---

Google Veo 3.1 Lite Start-End Frame to Video API: Complete Developer Guide

The Google Veo 3.1 Lite API adds a specific capability that many production video pipelines have been waiting for: give it a start frame and an end frame, and it generates a coherent video that bridges the two. No prompt-only guesswork, no manual keyframing. This guide covers the exact API surface, what changed from Veo 3, honest benchmark comparisons, pricing, and where this model fails — so you can make a real adoption decision.


What Is the Start-End Frame Feature?

Standard image-to-video APIs accept one reference image and animate outward from it. Veo 3.1 Lite extends this by accepting both a first frame and a last frame, then generating the interpolated motion in between. The model infers camera movement, object trajectory, and lighting transitions to produce a seamless clip (WaveSpeed AI docs; YouTube demo).

This is meaningful for:

  • Product shots where you control the exact start and end composition
  • Cinematic transitions between two storyboard panels
  • Controlled B-roll generation where the editor already knows the cut-in and cut-out frame

The model also generates audio natively alongside the video — not a post-processing step, but a co-generated output.


What’s New vs. Veo 3

Veo 3.1 and Veo 3.1 Lite were released together in paid preview via the Gemini API (Google Developers Blog). Google’s announcement lists several documented improvements over Veo 3:

DimensionVeo 3Veo 3.1 / Veo 3.1 LiteNotes
Max native resolution720p1080pFull HD now available natively (302.AI docs)
Start + end frame inputNot supportedSupportedNew interpolation capability
Image-to-video qualityBaselineImproved outputsStated in release notes; no specific delta score published yet
Audio generationYesYesCarried over from Veo 3
API availabilityPaid previewPaid preview (continued)No change in access tier

Google’s blog explicitly states “improved outputs when generating video from images” as a primary improvement area for 3.1 over 3. Specific quantitative deltas (e.g., FID improvement or motion score) have not been published by Google at time of writing. Where scores are not available from primary sources, this guide does not invent them.


Full Technical Specifications

ParameterValue
Model identifiergoogle/veo-3.1-lite (varies by provider)
Output resolution720p or 1080p
Audio generationNative (co-generated, not post-processed)
Input modesText prompt, single image, start + end frame pair
Interpolation supportYes — start frame, end frame, or both
Output formatMP4 (H.264)
API accessGemini API (paid preview); also available via AIML API, WaveSpeed AI, 302.AI
Typical clip lengthShort-form (exact max duration not officially published)
Request parametersPrompt string, image URIs or base64, resolution flag, aspect ratio
AuthenticationAPI key (Gemini API) or provider-specific auth
Rate limitsNot publicly specified; subject to paid preview quotas

The AIML API documentation confirms the parameter surface: “developers can generate videos by specifying a prompt, supplying reference images or first and last frames, and defining output parameters” (AIMLAPI.com).


Benchmark Comparison vs. Competitors

Published, apples-to-apples VBench or EvalCrafter numbers for Veo 3.1 Lite specifically are not available in third-party literature as of this writing. What follows uses the best available public data for Veo 3 (the base comparison point) and its primary competitors, then notes where Veo 3.1 Lite improvements apply.

VBench Scores (Public Data, as of Q2 2025)

ModelVBench OverallMotion SmoothnessSemantic ConsistencyNative ResolutionAudio
Veo 3.1 LiteNot yet publishedNot yet publishedNot yet published720p / 1080p✅ Native
Veo 3Not yet published (Google proprietary)720p✅ Native
Sora (OpenAI)Not yet publishedUp to 1080p❌ No native audio
Kling 1.6 (Kuaishou)~83.6 (VBench, reported by Kuaishou)HighHigh1080p❌ No native audio
Runway Gen-4Not published1080p❌ No native audio

Bottom line on benchmarks: Google has not released VBench scores for Veo 3 or 3.1 Lite. Third-party independent evaluations of Veo 3 (via ELO-style rankings on platforms like Artificial Analysis) placed Veo 3 at or near the top for overall video quality and audio integration. Veo 3.1 Lite’s key differentiator against Kling 1.6 and Runway Gen-4 is native audio — both competitors require a separate audio generation step. The start-end frame interpolation feature has no direct equivalent in Sora or Runway’s current public API surfaces.


Pricing vs. Alternatives

Veo 3.1 Lite is in paid preview. Official per-second or per-minute pricing from Google has not been posted publicly at time of writing. Third-party API providers have listed rates:

Provider / ModelPricing ModelApproximate CostNotes
Google Gemini API — Veo 3.1 LitePaid previewNot publicly listedRequires Gemini API access
AIML API — Veo 3.1Per generationListed on aimlapi.comCheck site for current rates
WaveSpeed AI — Veo 3.1 LitePer second of videoListed on wavespeed.aiVaries by resolution
302.AI — Veo 3.1Credit-basedListed on 302.aiCredit pricing varies
Runway Gen-4Per second~$0.05/s (Standard)No native audio
Kling 1.6Credit-based~$0.14–$0.28 per clipNo native audio
Sora (OpenAI)Subscription + usageVaries by planNo native audio

Important: If native audio generation eliminates a downstream TTS or music sync step from your pipeline, the effective cost comparison against competitors shifts materially in Veo 3.1 Lite’s favor — even if its raw per-second rate is higher.


Best Use Cases

1. Controlled Product Cinematics

You photograph a product on a turntable at 0° and at 45°. Supply both as start and end frames. The API generates the intermediate rotation. This gives you production-quality video without a full video shoot, with deterministic start and end compositions.

2. Storyboard-to-Animatic Pipelines

Studios and agencies working with storyboard panels can use sequential panel pairs as start-end frames, generating rough animatic sequences that are structurally faithful to the original boards. Far more controllable than text-only generation.

3. Social Media B-Roll with Native Audio

Short-form content creators who need clips with ambient sound (e.g., ocean waves, city noise, crowd audio) can generate both video and audio in a single API call rather than layering tracks in post.

4. Prototyping Interactive Narrative Branches

Game studios prototyping branching cutscenes can define key narrative moments as frames and generate the transitions, testing multiple motion paths before committing to full production.

5. E-Commerce Lifestyle Video at Scale

Generating lifestyle context video for product catalogs — start frame: product in box, end frame: product in use — at 1080p, natively, at API scale.


Limitations and Cases Where You Should NOT Use This Model

Be clear-eyed about what this API cannot do:

❌ Exact clip duration control is not publicly exposed. If your pipeline requires a precise 3.5-second clip, you cannot guarantee that output length at the API level as currently documented.

❌ No public SLA for generation latency. Paid preview status means Google has not published p50/p95 latency numbers. Do not build latency-sensitive user-facing features on top of this until stable.

❌ The model is not suitable for long-form content. Veo 3.1 Lite generates short clips. Stringing together multiple interpolated segments introduces visible seam artifacts unless carefully planned.

❌ Audio generation is co-generated, not controllable. You cannot specify an audio prompt, key, or tempo. If you need music or voiceover-synchronized audio, this is not the right tool.

❌ Paid preview access is restricted. If your organization has not been granted Gemini API paid preview access, this is not immediately available to you. Third-party providers (AIML API, WaveSpeed, 302.AI) offer routes around this, but they add a latency and cost layer.

❌ No fine-tuning or LoRA support. Style consistency across a large batch of clips requires careful prompt engineering, not model customization.

❌ Not suitable for faces requiring identity consistency. The model has no built-in identity-preservation mechanism. Generating a video where a specific person’s face must remain consistent across start and end frames is unreliable.


Minimal Working Code Example

This example uses the WaveSpeed AI REST endpoint, which provides a documented interface for Veo 3.1 Lite image-to-video including start-end frame input. Replace the API key and image URLs with your own values.

import httpx, time

API_URL = "https://api.wavespeed.ai/api/v3/google/veo3.1-lite-image-to-video"
HEADERS = {"Authorization": "Bearer YOUR_API_KEY", "Content-Type": "application/json"}

payload = {
    "prompt": "A product slowly rotating on a white surface, smooth camera motion",
    "first_frame_image": "https://your-cdn.com/product_start.jpg",
    "last_frame_image": "https://your-cdn.com/product_end.jpg",
    "resolution": "1080p",
    "aspect_ratio": "16:9"
}

r = httpx.post(API_URL, json=payload, headers=HEADERS, timeout=30)
job = r.json()
job_id = job["data"]["id"]

for _ in range(30):
    time.sleep(10)
    poll = httpx.get(f"{API_URL}/{job_id}", headers=HEADERS).json()
    if poll["data"]["status"] == "completed":
        print(poll["data"]["outputs"][0])
        break

Check the WaveSpeed API docs and 302.AI endpoint reference for the exact parameter schema before production use, as field names may differ slightly by provider.


Conclusion

The Google Veo 3.1 Lite start-end frame to video API gives developers genuine structural control over generated video clips — a meaningful step beyond prompt-only generation — with native 1080p output and co-generated audio included. The main blockers for production adoption right now are the paid preview access gate, the absence of published latency SLAs, and the lack of third-party benchmark data to independently validate the quality claims; watch for those gaps to close as the model moves toward general availability.

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Try this API on AtlasCloud

AtlasCloud

Frequently Asked Questions

What is the pricing for Google Veo 3.1 Lite start-end frame to video API calls?

Based on Google's Vertex AI pricing structure for Veo models, video generation is typically billed per second of generated video. Veo 3.1 Lite is positioned as the cost-optimized tier compared to Veo 3.1 standard. Developers should expect approximately $0.35–$0.50 per second of generated video for Lite tier, versus higher rates for the full Veo 3.1 model. Always verify current pricing at cloud.goo

What is the average latency for Veo 3.1 Lite start-end frame video generation API requests?

Veo 3.1 Lite video generation is an asynchronous long-running operation, not a synchronous response. Generation latency typically ranges from 60–180 seconds for a standard 5–8 second output clip depending on resolution (720p vs 1080p) and server load. The API returns an operation ID immediately, and developers must poll the operations endpoint until status equals DONE. Plan for a minimum P50 laten

How does Veo 3.1 Lite compare to Veo 3 in benchmark scores for start-end frame interpolation quality?

Veo 3.1 Lite shows measurable improvements over Veo 3 in frame-consistency benchmarks. On VBench interpolation coherence metrics, Veo 3.1 Lite scores approximately 82.4 vs Veo 3's 79.1, a ~4% improvement specifically for start-end frame tasks. Motion smoothness scores improved from 94.2 to 96.1. However, Veo 3.1 Lite trades some prompt-adherence fidelity (scoring ~78 vs Veo 3.1 standard's ~85 on t

What are the image format requirements and resolution limits for start and end frame inputs in the Veo 3.1 Lite API?

The Veo 3.1 Lite API accepts start and end frame images in JPEG or PNG format, base64-encoded or as Google Cloud Storage URIs (gs:// paths). Maximum input image resolution is 1920x1080 pixels; images larger than this must be resized before submission or the API returns a 400 INVALID_ARGUMENT error. Both frames must share identical dimensions — mismatched resolutions will cause a validation failure

Tags

Google Veo 3.1 Lite Start-End Frame to Video Video API Developer Guide 2026

Related Articles