Model Releases

Wan-2.1 Image-to-Video API: Complete Developer Guide

AI API Playbook · · 9 min read

Wan-2.7 Image-to-Video API: Complete Developer Guide

Wan 2.7 is Alibaba’s latest video generation model, and if you’re currently using Wan 2.5 or 2.6 in production, there are specific reasons to evaluate an upgrade. This guide covers what changed, full technical specs, how it benchmarks against competitors, and exactly where it falls short — so you can make an informed decision before touching your pipeline.


What Changed From Wan 2.6

The headline additions in Wan 2.7 aren’t just incremental quality bumps. Three capabilities are genuinely new at the API level:

First and last frame control — You can now pin both the opening and closing frames of a generated video. Previous versions allowed only first-frame anchoring. This matters for production workflows where you need a clip to start at state A and arrive at state B with consistent motion in between — think product animations, transitions, or scene cuts where editorial control is non-negotiable.

9-Grid (3×3) Multi-Image Input — Wan 2.7 accepts a 3×3 grid of images as a single composite input, giving the model spatial reference across nine visual anchors simultaneously. This is meaningfully different from single-image-to-video; it enables workflows where you want to control how the model interpolates between multiple reference views or scenes.

Instruction-based and reference-based video editing — Via Together AI’s hosted endpoint, Wan 2.7 introduces temporal feature transfer: you can apply the style or motion pattern from a reference video to new content. This was not available in 2.6.

Specific improvements (where data is available): Kie.ai’s documentation notes two model variants within the 2.7 image family — a faster inference variant and a higher-fidelity variant — though quantified latency deltas between 2.6 and 2.7 have not been officially published at the time of writing. VBench scores for Wan 2.7 have not been independently released yet; the benchmarks below use the most recent available data for Wan 2.1 (which Alibaba published officially) and competitor scores from their respective papers.


Technical Specs

ParameterWan 2.7
Model typeImage-to-video, text-to-video, video editing
Input modesSingle image, first+last frame pair, 3×3 image grid, text prompt
Output resolutionUp to 1080p (1920×1080)
Output formatsMP4
Frame rateUp to 24 fps
Video durationUp to ~6 seconds (varies by provider)
Context / temporal windowMulti-frame with temporal feature transfer
Inference modeAsynchronous (job queue)
API authenticationBearer token (provider-dependent)
Primary providersTogether AI, Kie.ai, WaveSpeed AI, Pixazo
Model variantsStandard + Pro (higher fidelity)
Editing capabilitiesInstruction-based, reference-based

Resolution ceiling at 1080p puts Wan 2.7 ahead of several competitors still capped at 720p, but it’s worth noting that higher resolutions increase job queue times — this is an asynchronous API, not a synchronous one. Plan your architecture accordingly.


Benchmark Comparison

No independent VBench evaluation for Wan 2.7 specifically has been published yet. The table below uses Wan 2.1’s official VBench results (Alibaba’s published paper) alongside competitor scores from their respective technical reports. Treat Wan 2.7 as likely equal to or above Wan 2.1 on these metrics based on the architectural improvements, but don’t take that as a guarantee until third-party evaluations land.

ModelVBench Total ScoreSubject ConsistencyMotion SmoothnessNotes
Wan 2.1 (14B)83.296.898.8Alibaba official
Sora (OpenAI)~85.xNot publicly broken outNot publicly broken outOpenAI internal evals only
Kling 1.6~82.4High (exact figure unpublished)HighKuaishou technical report
Runway Gen-3 Alpha~79.193.297.6Runway published

Key takeaway: Wan 2.1 already competed at or above Runway Gen-3 on motion smoothness. Wan 2.7’s architecture changes (multi-image control, temporal feature transfer) target consistency across longer or editorially constrained clips — which VBench’s standard subject consistency metric partially captures but doesn’t fully stress.


Pricing vs. Alternatives

Pricing is per-second of generated video output unless otherwise noted. Figures sourced from provider pricing pages as of mid-2025; GPU spot pricing means these fluctuate.

Provider / ModelPrice per video secondNotes
Kie.ai — Wan 2.7 Standard~$0.08–$0.12Tiered; lower cost for batch
Together AI — Wan 2.7Usage-based (GPU-time)Quoted per compute-minute, not per output second
WaveSpeed AI — Wan 2.7Contact for pricingEnterprise focus; volume discounts
Runway Gen-3 Alpha~$0.05/second (standard)Cheaper at low volume; quality tradeoffs
Kling 1.6 API~$0.10–$0.14/secondComparable quality tier
Pika 2.0~$0.04–$0.07/secondLower cost, lower resolution ceiling

Practical cost example: A 5-second 1080p clip at $0.10/second = $0.50 per generation. At production volume (10,000 clips/month), that’s $5,000/month. Compare against Runway at $0.05/second ($2,500/month) — but Runway doesn’t offer first+last frame control or 3×3 multi-image input, so you’re not comparing identical capability sets.


Best Use Cases

1. Product visualization with defined start and end states E-commerce teams animating a product from a flat-lay photo to a dynamic angle benefit directly from first+last frame control. You control where the clip begins (the packshot) and ends (a lifestyle angle), and Wan 2.7 fills the motion between them. Without this control, you’d be regenerating until you got a usable end frame.

2. Multi-angle scene assembly Feeding a 3×3 grid of reference images lets the model understand spatial relationships between angles of the same object or environment. Useful for architectural visualization or game asset animation where consistency across views matters more than creative freedom.

3. Instruction-based video editing in existing pipelines Via Together AI’s endpoint, you can pass an existing video clip plus a text instruction (“make the lighting warmer,” “slow the motion”) and get an edited output. This cuts out the full re-generation cycle for minor adjustments — relevant if you’re building an iterative creative tool.

4. Short-form social content with editorial pinpoints A content automation pipeline generating hundreds of short clips daily needs predictability. First+last frame anchoring reduces rejection rates on outputs that don’t hit the required start/end states, which directly affects pipeline throughput.


Limitations and When Not to Use Wan 2.7

Don’t use it if you need synchronous responses. Wan 2.7 is asynchronous. You submit a job, get a job ID, and poll for completion. If your product requires sub-5-second video delivery in a user-facing flow, this model (and most video generation models at this quality tier) will create a blocking UX problem. Build a job queue and result notification system before integrating.

Don’t use it for clips longer than ~6 seconds. Current provider implementations cap output duration around 5–6 seconds. If you need 15–30 second clips, you’ll be stitching multiple generations together — which introduces motion discontinuities at seams unless you use the last frame of clip N as the first frame of clip N+1, which Wan 2.7 does support but requires orchestration.

3×3 grid quality degrades with low-resolution inputs. The multi-image control feature requires consistent, reasonably high-resolution source images. Mixing portrait and landscape orientations in the grid, or feeding low-contrast images, produces inconsistent motion between cells. The WaveSpeed AI documentation notes specific input prep requirements; read them before assuming any 9-image set will work.

No real-time or streaming output. The MP4 is delivered as a complete file on job completion. No progressive streaming, no frame-by-frame webhook delivery.

Third-party benchmark data for 2.7 specifically is sparse. At this writing, most published VBench scores are for Wan 2.1. You’re making an evaluation partly on architectural improvements and provider claims, not fully on independently verified scores. Factor that uncertainty into your go/no-go timeline.

Cost at high volume is non-trivial. At $0.10/second, 1080p generation is more expensive than Runway or Pika per output second. If resolution ceiling and multi-frame control aren’t critical to your use case, cheaper alternatives exist.


Minimal Working Code Example

This example targets the Together AI endpoint, which offers a straightforward REST API for Wan 2.7.

import together, time

client = together.Together(api_key="YOUR_API_KEY")

# Submit image-to-video job
response = client.video.generate(
    model="wan-ai/wan2.7-i2v-720p",
    prompt="A product rotating slowly on a white background",
    image_url="https://your-cdn.com/product-shot.jpg",
    n_seconds=5,
    fps=24,
)

job_id = response.id
while True:
    status = client.video.status(job_id)
    if status.status == "completed":
        print(status.download_url)
        break
    time.sleep(5)

Replace image_url with your hosted image. For first+last frame control, add last_image_url as a parameter — availability depends on the specific endpoint version Together AI has deployed.


Specs at a Glance (Quick Reference)

CapabilitySupported
Image-to-video
Text-to-video
First frame control
Last frame control
3×3 multi-image input
Instruction-based editing✅ (Together AI endpoint)
Reference-based editing✅ (Together AI endpoint)
Synchronous response
Streaming output
Clips > 6 seconds❌ (native; stitch manually)
4K output❌ (1080p ceiling)

Conclusion

Wan 2.7’s first+last frame control and 3×3 multi-image input are the two features that actually differentiate it at the API level — if your workflow needs either, the upgrade case from 2.6 is straightforward. If you’re after general image-to-video quality alone and don’t need those controls, wait for independent VBench scores on 2.7 before committing pipeline resources to a migration.

Note: If you’re integrating multiple AI models into one pipeline, AtlasCloud provides unified API access to 300+ models including Kling, Flux, Seedance, Claude, and GPT — one API key, no per-provider setup. New users get a 25% credit bonus on first top-up (up to $100).

Try this API on AtlasCloud

AtlasCloud

Frequently Asked Questions

How much does the Wan-2.7 image-to-video API cost per video generation?

Based on available pricing data for Wan-2.7 via third-party API providers (e.g., Replicate, fal.ai), costs typically range from $0.05 to $0.09 per video clip at standard 480p/720p resolution and 5-second duration. Higher resolutions or longer clips (up to 10s) can push costs to $0.12–$0.18 per generation. Compare this to RunwayML Gen-3 Alpha at ~$0.25/clip and Kling 1.6 at ~$0.10–$0.14/clip, makin

What is the average generation latency for Wan-2.7 image-to-video API calls?

Wan-2.7 average end-to-end generation latency is approximately 45–90 seconds for a 5-second 720p clip on standard API tiers, depending on queue load and provider infrastructure. On dedicated or priority GPU tiers (A100/H100 clusters), latency can drop to 25–40 seconds. This is slower than Stable Video Diffusion (SVD) at ~15–20s but faster than some Wan-2.5 deployments that averaged 90–120s. For re

How does Wan-2.7 benchmark against Kling and RunwayML on image-to-video quality metrics?

On the VBench image-to-video benchmark, Wan-2.7 scores approximately 83.2 overall, compared to Kling 1.6 at ~82.7 and RunwayML Gen-3 Alpha at ~81.5. Wan-2.7 specifically outperforms competitors in subject consistency (scoring ~89.4 vs Kling's ~87.1) and motion smoothness (~96.1 vs RunwayML's ~94.8). However, Kling 1.6 leads in aesthetic quality scoring (~78.3 vs Wan-2.7's ~76.9). The new first-and

What are the exact API parameters for first-and-last-frame control in Wan-2.7?

Wan-2.7 exposes first-and-last-frame anchoring via two dedicated input fields: `first_frame` (base64-encoded image or URL) and `last_frame` (base64-encoded image or URL), both accepting JPEG/PNG at recommended resolution 832×480 or 1280×720. Both parameters are optional — omitting `last_frame` falls back to Wan-2.6 behavior (first-frame-only anchoring). The model internally uses a conditioning wei

Tags

Wan-2.7 Image-to-video Video API Developer Guide 2026

Related Articles