Skip to content

Text / Image to Lottie: A Landscape Overview of AI Animation Generation Tools

Jun 9, 2026 1 min
TL;DR From the CLI tool kin3o to the CVPR 2026 paper OmniLottie — a survey of open-source approaches for converting text and images into Lottie animations, with performance benchmarks and selection guidance.

🌏 中文版

Lottie is Airbnb’s JSON-based vector animation format, introduced in 2015 and now the de facto standard for lightweight UI animations on iOS, Android, and the web. Yet for years there were no good open-source tools to turn a plain text description or a single image into a ready-to-run Lottie animation.

Early 2026 brought a qualitative leap: two CVPR papers simultaneously proposed end-to-end vector animation generation frameworks, and two developer-focused tools make it possible to get started today. This post surveys the technical approaches, benchmarks, and how to choose between them.

Why Is It So Hard to Have an LLM Generate Lottie Directly?

Understanding the core difficulty starts with the format itself — Lottie is not LLM-friendly.

A typical Lottie JSON averages around 18,202 raw tokens (per benchmark data from the AnimTOON paper), and the format has several traps:

  • Implicit conventions: colors are 0–1 floats (not 0–255), time is expressed in frame numbers (not seconds), and easing uses Bézier control points — all things LLMs routinely get wrong
  • Strict nesting: multi-level JSON nesting of layers → shapes → keyframes means a single malformed node breaks everything
  • Verbose boilerplate: large amounts of fixed metadata eat into the token budget, leaving little room for the actual animation logic

The OmniLottie (CVPR 2026) paper directly quantifies this: when GPT-5 generates Lottie JSON directly, the success rate is only 9.2%; for Gemini 3.1 Pro it is 0%. This isn’t a model capability problem — it’s a format problem.

Four Technical Approaches

Approach A: LLM Prompt Engineering
  Text → Claude / GPT generates JSON → Validate + auto-fix → .lottie
  Pros: No GPU needed, usable today
  Cons: Quality bounded by LLM capability, unstable for complex animations

Approach B: End-to-End VLM Fine-tuning
  Text / Image / Video → Custom tokenizer compression → Qwen2.5-VL → .lottie
  Pros: Highest quality, rigorous academic benchmarks
  Cons: High hardware requirement (~15 GB VRAM), training code not open-sourced

Approach C: Intermediate Format + Smaller Model
  SVG + Text → AnimTOON format (166 tokens) → 3B LoRA → Converter → .lottie
  Pros: Runs on consumer GPU (~5 GB), highest token efficiency
  Cons: SVG input required — text alone is not sufficient

Approach D: Static SVG Conversion
  SVG → Corresponding Lottie layer structure → .json (no animation)
  Pros: Deterministic, 100% reliable, mature tooling
  Cons: Format conversion only — no animation is added

kin3o (Approach A)

kin3o is the most immediately usable text-to-Lottie tool available today. Rather than training a model, it wraps LLM calls with careful engineering around them:

npx @afromero/kin3o generate "pulsing circle that breathes"
npx @afromero/kin3o generate "toggle switch with on/off states" --interactive

The core design principle is “let the LLM generate, handle everything else ourselves”: a carefully crafted system prompt with few-shot examples steers Claude or Codex away from common mistakes, followed by JSON extraction, structural validation, and auto-fix (RGB 0–255 → 0–1, missing fields, malformed keyframes) before writing to disk.

It supports three AI providers: Claude Code CLI (auto-detects an existing logged-in session), Codex CLI, or a direct Anthropic API key. The --interactive flag outputs a dotLottie state machine with hover/click states, and it also integrates with LottieFiles marketplace search and publishing.

When to choose kin3o: no GPU to install, need something working immediately, output that can be git-diffed.
Not suitable for: complex multi-layer animations, image input, or strict quality requirements.


OmniLottie (Approach B)

OmniLottie (arXiv:2603.02138, CVPR 2026) is the most academically rigorous solution available, and the first end-to-end Lottie generation framework to support text, image, and video as inputs.

The core innovation is the Lottie Tokenizer: it decomposes the nested structure of raw Lottie JSON into three categories of command tokens — shape, effect, and animation — compressing the 18,202-token raw format down to approximately 20–40k custom tokens. Qwen2.5-VL is then fine-tuned on this compressed vocabulary, allowing the model to focus on learning animation semantics rather than format details.

The training dataset MMLottie-2M contains 2 million Lottie animations with text, keyframe, and video annotations, paired with the evaluation benchmark MMLottieBench (900 samples, Real and Synthetic subsets).

Results on MMLottieBench (from arXiv:2603.02138 Table 1):

MethodSuccess RateAvg. Generation TimeObject AlignmentMotion Alignment
OmniLottie88.3%~29s4.445.94
GPT-59.2%~46s13.34
Gemini 3.1 Pro0%~39s
Qwen2.5-VL (3B)0%~21s

On the text-image-to-Lottie task, OmniLottie is 52× faster than AniClipart (88s vs 1212s) with a 93.3% success rate.

Limitations: inference requires ~15 GB VRAM (⚠️ this figure comes from AnimTOON’s comparison table, not officially confirmed), output frame rate is only 8 fps, and training code is not yet open-sourced.


LottieGPT (Approach B — not yet available)

LottieGPT (arXiv:2604.11792, CVPR 2026) follows a similar path to OmniLottie but with a different tokenizer design: it uses keyframe-based temporal compression, encoding only keyframes and interpolation functions rather than every frame, achieving lossless roundtrip (decoded animation is perfectly identical to the original rendering).

The training dataset is larger: LottieImage-15M (15 million static images) + LottieAnimation-660K (660k animations), using a two-stage training strategy — learn statics first, then learn motion.

⚠️ Important: as of 2026-06-09, inference code and model weights have not been released (the open-source plan checkbox on GitHub remains unchecked). This one is paper-only for now.


AnimTOON (Approach C)

AnimTOON takes a completely different angle: rather than having the model generate shapes, it separates shape from animation.

The idea is that the model only needs to generate “animation keyframe descriptions”; shapes are extracted from the input SVG and combined deterministically by a Converter. This compresses model output to 166–597 tokens instead of OmniLottie’s 4k–40k:

# AnimTOON format example (a complete animation in just 166 tokens)
anim fr=30 dur=120

layer Logo shape
  fill #000000
  path sh x2
  pos [0.5,0.5]
  rot 0.0->-67 0.04->46 0.14->-31 0.28->0 ease=bounce
  scale 0.0->[0,0] 0.14->[90,90] 0.28->[100,100] ease=smooth
  opacity 0.0->0 0.14->100 ease=fade

Quantitative comparison with OmniLottie (from AnimTOON README benchmark):

MetricAnimTOONOmniLottie
Output tokens (simple)166616
Output tokens (complex)5974,095
Generation time13–38s55–120s+
Frame rate30 fps8 fps
Inference VRAM~5 GB~15.2 GB
Custom tokenizerNo (plain text)Yes (40k tokens)
Format success rate100% (converter guaranteed)88.3%

Limitations: SVG input is required (cannot generate shapes from text alone), training is not yet complete (~60%), and the community is very small (Stars: 5). Best suited for scenarios where you already have SVG artwork and want animation added automatically.


Static SVG → Lottie Conversion Tools

If the requirement is format conversion rather than animation generation, these tools are more reliable:

  • python-lottie: the Swiss Army knife of format conversion, supporting bidirectional conversion between SVG, GIF, Synfig, Telegram TGS, WebP, dotLottie, and more — the most mature non-AI option
  • stepancar/svg-to-lottie: JavaScript implementation, CLI + browser API, supports basic shapes like rect, circle, and path
  • marciogranzotto/lottie-tools: web-based editor, allows SVG import followed by manual keyframe animation, then export to Lottie JSON

These tools produce static Lottie output (shapes are preserved; animation must be added manually) and are appropriate when deterministic results are required and AI output instability is unacceptable.


Comparison with Commercial Solutions

LottieFiles Motion CopilotRiveLottielab
InputText (within editor)Visual editingVisual editing
Skeletal animationNo✅ with mesh deformationNo
Runtime interactivityLimited✅ (mouse, sliders)Limited
Open sourceNoPlayer is open sourceNo
LLM-generatable format✅ (JSON)❌ (binary .riv)✅ (JSON)

Rive is far ahead on skeletal animation and runtime interactivity, but its .riv format is binary — LLMs cannot generate it directly. This is where the Lottie open-source ecosystem has an inherent advantage: JSON is naturally AI-friendly.

Commercial AI features (Motion Copilot, etc.) integrate designer workflows (editor + asset library + collaboration), which open-source tools currently lack. But the quality gap in pure generation is closing quickly.


How to Choose

ScenarioRecommended Tool
Need something working now, no GPU to installkin3o
Image or video input requiredOmniLottie
Already have SVG, want animation added, limited GPU memoryAnimTOON (once training completes)
Research / want to fine-tune on MMLottie-2MOmniLottie or LottieGPT (the latter, once code is released)
Format conversion only, no animation neededpython-lottie
Skeletal animation / runtime interactivityRive (outside the open-source Lottie ecosystem)

The open-source tooling in this space only matured in the first half of 2026 — OmniLottie and LottieGPT were both accepted at CVPR this year, and AnimTOON is still in training. For UI animation work today, kin3o is the lowest-friction entry point. If you need higher-quality image-driven generation, OmniLottie is already usable.

References