Skip to content

OpenClaw Model Advanced: Failover, Prompt Caching, and Token Billing

Mar 28, 2026 1 min
TL;DR OpenClaw has built-in two-stage fault tolerance with Auth rotation + Model Fallback, plus Prompt Caching for cost savings and comprehensive Token tracking.

🌏 中文版

Once you have chosen your provider and model, the next step is making it stable and cost-effective. This post covers three advanced model features in OpenClaw: failover switching, Prompt Caching, and Token tracking with cost control.

Model Failover

OpenClaw’s failover works in two stages:

Stage 1: Round-robin Auth Profile rotation within the same provider
         ↓ All in cooldown
Stage 2: Switch to fallback model

Auth Profile Rotation

You can configure multiple API Keys for the same provider. Auth profiles are stored in ~/.openclaw/agents/<agentId>/agent/auth-profiles.json.

Selection priority:

  1. Explicit order specified in auth.order[provider]
  2. Profiles from the config file (filtered by provider)
  3. Stored profiles from auth-profiles.json

Default sorting: OAuth takes priority over API Key; within the same type, “oldest first” (based on usage statistics).

Session Stickiness

Once a profile is selected, it remains unchanged for the entire session — this is to maintain cache efficiency on the provider side. Switching only occurs on session reset, compaction completion, or when the profile enters cooldown.

Cooldown Escalation

Error TypeCooldown Duration
General failure1 min → 5 min → 25 min → 1 hr (cap)
Billing/quota failure5 hr → 10 hr → 20 hr → 24 hr (cap)

Billing errors carry heavier penalties because retrying in the short term won’t help.

Model Fallback

After all profiles are exhausted, it switches to the next model in agents.defaults.model.fallbacks:

{
  agents: {
    defaults: {
      model: {
        primary: "anthropic/claude-opus-4-6",
        fallbacks: ["openai/gpt-5.4", "google/gemini-3.1-pro-preview"]
      }
    }
  }
}

Thinking Degradation

When an extended thinking call fails, it automatically degrades to normal mode without interrupting the conversation. This is a provider-level fallback, independent of model fallback.

Prompt Caching

Model providers can reuse the unchanged prompt prefix in a conversation, saving both tokens and latency. The first request pays the cacheWrite cost, and subsequent matching requests benefit from a cacheRead discount.

Configuration

{
  agents: {
    defaults: {
      models: {
        "anthropic/claude-opus-4-6": {
          params: {
            cacheRetention: "short"  // none | short (5 min) | long (1 hr)
          }
        }
      }
    }
  }
}

Anthropic API Keys default to short. The legacy setting cacheControlTtl: "5m" automatically maps to short.

Provider Support

ProviderSupportNotes
Anthropic (direct API)✅ FullUnconfigured Anthropic models default to short
Amazon BedrockOnly for Anthropic Claude models; others forced to none
OpenRouter (Anthropic)Automatically injects cache control for system/developer blocks
Other providersSettings have no effect

Context Pruning + Cache TTL

Prevents large context re-caching after idle periods:

{
  contextPruning: {
    mode: "cache-ttl",
    ttl: "1h"
  }
}

Heartbeat Keep-Alive

Periodic heartbeat messages keep the cache window active, avoiding full re-cache after idle periods:

{
  heartbeat: {
    every: "55m"  // Set within the cache TTL, e.g., 55min heartbeat for 1hr TTL
  }
}

Mixed traffic: Use long + heartbeat for primary agents; use none for notification-type agents.

Cost-oriented: Use short as baseline, enable cache-TTL pruning, and add heartbeat only where needed.

Diagnostics

# Enable cache trace
OPENCLAW_CACHE_TRACE=1 openclaw gateway

This logs cache hits, writes, and token savings in JSONL format.

Token Consumption Tracking

What Consumes Tokens

Everything sent to the model counts toward context:

  • System prompt (dynamically assembled: tool descriptions, skills, bootstrap files, safety guidelines…)
  • Conversation history
  • Tool calls and results
  • Attachments (images, audio, files)
  • Compaction summaries
  • Provider wrappers and safety headers

Bootstrap files (AGENTS.md, SOUL.md, etc.) have a per-file limit of 20,000 characters and a total limit of 150,000 characters.

Images are resized before API calls; imageMaxDimensionPx defaults to 1200.

Monitoring Methods

In chat:

  • /status — Session model, context usage, token count, estimated cost (API key only)
  • /usage full — Cost footer appended to each reply
  • /usage tokens — Shows only token count (no amounts for OAuth)
  • /usage cost — Aggregated cost from session log

CLI:

  • openclaw status --usage — Usage breakdown per provider

Cost Calculation

Based on model pricing configuration (USD/1M tokens), with four rate types: input, output, cache read, and cache write. OAuth authentication does not display amounts.

10 Features That Incur API Costs

It’s not just chatting that costs money:

#FeatureDescription
1Core model responsesPrimary cost source
2Media understandingAudio/image/video processing
3Memory embeddingsSemantic search (OpenAI/Gemini/Voyage/Mistral/Ollama)
4Web searchBrave/Gemini/Grok/Kimi/Perplexity
5Web fetchFirecrawl (optional)
6Provider status queries/status querying usage endpoints
7CompactionAutomatic session summarization
8Model scanningOpenRouter capability probing
9Speech/TalkElevenLabs TTS
10SkillsThird-party APIs from custom integrations

Usage Tracking

OpenClaw directly queries provider usage endpoints, displaying actual consumption rather than estimates.

Supported providers: Anthropic, GitHub Copilot, Gemini CLI, OpenAI Codex, MiniMax, z.ai, and others. Corresponding OAuth or API Key is required for display.

Ways to Save Tokens

  • /compact — Compress history from long conversations
  • Reduce tool output size
  • Lower imageMaxDimensionPx (for screenshot-heavy scenarios)
  • Streamline skill descriptions
  • Use smaller models for exploratory tasks

Overall

Three levels of model management:

  1. Stability: Auth rotation + Model Fallback + Thinking degradation → Ensure no interruptions
  2. Cost savings: Prompt Caching + Context Pruning + Heartbeat → Reduce duplicate tokens
  3. Visibility: Usage Tracking + /status + Cache Trace → Know where the money goes

Set up fallbacks and caching properly, and OpenClaw will handle most situations on its own.

References

This post is compiled from the following OpenClaw source documents: