Skip to content

Gemma on Cloudflare Workers AI: A Pragmatic Choice for Traditional Chinese Applications

Apr 28, 2026 1 min
TL;DR For running LLMs on Cloudflare Workers AI, gemma-3-12b-it follows Traditional Chinese instructions noticeably better than llama-3.1-8b-instruct. With Gemma 4 arriving in 2026, you get Vision, Function calling, and 256K context -- upgrade as needed.

🌏 中文版

Choosing an LLM isn’t about picking “the most powerful one” — it’s about picking “the one that works within your constraints.” nobodyclimb runs on Cloudflare Workers, and AI inference stays within the Cloudflare ecosystem — @cf/google/gemma-3-12b-it is the best option under these constraints.

2026-04 Update: Cloudflare Workers AI now offers @cf/google/gemma-4-26b-a4b-it, bringing a 256K context window, Vision, and Function calling support. See the Gemma 4 comparison section at the bottom of this post.

What Is Cloudflare Workers AI

Cloudflare Workers AI is Cloudflare’s inference service that lets you call hosted models directly from the Workers environment without managing GPU infrastructure. Billing is token-based.

Supported models span text generation, embedding, image generation, speech-to-text, and more. On the LLM side, mainstream open-source models like Llama, Mistral, Gemma, and Qwen are currently available.

// In the Workers environment, bindings work like this
const response = await env.AI.run("@cf/google/gemma-3-12b-it", {
  messages: [
    { role: "system", content: "You are an AI assistant for a Taiwan rock climbing community." },
    { role: "user", content: "What routes at Longdong are suitable for beginners?" }
  ],
  max_tokens: 1024,
  stream: true,
});

Compared to self-hosting an inference service, the benefits are clear: no GPU management, no model serving ops work, and it lives in the same environment as other Workers bindings (D1, KV, Vectorize).

Why gemma-3-12b-it, Not llama-3.1-8b-instruct

nobodyclimb initially used llama-3.1-8b-instruct. The main reasons for switching:

Traditional Chinese instruction following: Llama 3.1 8B’s Traditional Chinese output quality was inconsistent — it would occasionally mix in Simplified Chinese characters or ignore formatting instructions in the system prompt (e.g., “answers must include source links”). Gemma 3 is noticeably more reliable in this regard.

12B vs 8B: The parameter count gap is perceptible in the RAG Q&A use case. Gemma 3 12B utilizes context better — when given 5 retrieved documents, it can synthesize information more accurately instead of only using the first few.

Gemma 3’s multilingual training: Google included more comprehensive multilingual coverage in Gemma 3’s training data, with a higher proportion of Chinese (including Traditional Chinese) compared to Llama 3.1’s publicly known training configuration.

This isn’t to say Llama is bad — it’s that for the specific use case of Traditional Chinese RAG, gemma-3-12b-it is a better fit.

Basic Usage

Non-streaming (suitable for evaluation, background jobs):

const result = await env.AI.run("@cf/google/gemma-3-12b-it", {
  messages: [
    { role: "system", content: systemPrompt },
    { role: "user", content: userQuery }
  ],
  max_tokens: 512,
});

const answer = result.response; // string

Streaming (suitable for user interfaces):

const stream = await env.AI.run("@cf/google/gemma-3-12b-it", {
  messages: [...],
  stream: true,
});

// With Hono's streamSSE
return streamSSE(c, async (sseStream) => {
  for await (const chunk of stream) {
    if (chunk.response) {
      await sseStream.writeSSE({ data: chunk.response });
    }
  }
});

JSON output (suitable for structured tasks like judge, filter-build):

const result = await env.AI.run("@cf/google/gemma-3-12b-it", {
  messages: [
    {
      role: "system",
      content: "Reply in JSON format: { score: number, reason: string }"
    },
    { role: "user", content: `Evaluate the quality of this answer: ${answer}` }
  ],
  response_format: { type: "json_object" },
  max_tokens: 256,
});

const evaluation = JSON.parse(result.response);

Usage Scenarios in nobodyclimb

The system has multiple pipeline steps that require LLM calls:

StepTaskOutput Type
HyDEGenerate hypothetical answer documentsPlain text
multi-queryExpand query into multiple anglesJSON array
filter-buildExtract structured search conditionsJSON object
llm-generationFinal answer generationPlain text (streaming)
judgeEvaluate answer qualityJSON object
agenticDecisionDetermine if information is sufficientJSON boolean + reasoning

Same model, different prompt engineering, performing vastly different tasks. This is the strategy of having one sufficiently capable model handle everything, rather than selecting the smallest adequate model for each task — on Cloudflare Workers AI, this makes management simpler.

Limitations of Cloudflare Workers AI

Not spelling these out clearly will lead to painful pitfalls later:

CPU time limits: Workers have CPU time caps (30 seconds on the Paid plan, but AI calls don’t count toward CPU time — they count toward wall time). A pipeline with multiple LLM calls (HyDE + generation + judge) can collectively exceed Workers’ execution time limits. nobodyclimb’s solution is to make judge writes asynchronous (non-blocking to the main flow) and only enable HyDE for complex queries.

Opaque model versioning: Cloudflare manages model versions, and you can’t pin a specific checkpoint. Model behavior may change without notice. You need monitoring mechanisms to detect anomalies in output quality.

No fine-tuning: Hosted models on Workers AI currently can’t be fine-tuned. Domain adaptation relies entirely on prompt engineering and RAG.

Cold start latency: During low-traffic periods, the first call may have higher latency. Semantic caching can mitigate this (cache hits avoid LLM calls entirely).

Context window: gemma-3-12b-it on Cloudflare Workers AI has a context window of 8192 tokens according to official documentation. Be careful not to exceed the limit with long conversations or large numbers of retrieved documents.

Comparison with Other Options

gemma-3-12b-it (Workers AI)gemma-4-26b-a4b-it (Workers AI)OpenAI GPT-4o-miniSelf-hosted Ollama
Trad. Chinese qualityGoodGoodVery goodModel-dependent
Context window8K tokens256K tokens128K tokensModel-dependent
VisionNoYesYesModel-dependent
Function callingNoYesYesModel-dependent
Ops costZeroZeroZeroHigh
LatencyMediumFast (MoE active 4B)LowHardware-dependent
FlexibilityLowLowMediumHigh
Pricing structureToken-based$0.10/$0.30 per M tokensToken-basedFixed hardware cost

If Traditional Chinese quality is the top priority, the GPT-4o series is still stronger. But if you’re already in the Cloudflare ecosystem and don’t want to maintain another AI service account and API key, Workers AI is the smoothest choice.

Real-World Observations

After several months of use:

  • Traditional Chinese instruction following is more stable than Llama; formatting requirements in system prompts (citing sources, JSON output) are generally followed
  • Occasional hallucination issues are caught by the judge + self-reflection mechanism — groundedness below 0.5 triggers a retry
  • 12B inference speed isn’t fast; the first token in streaming typically takes 1-2 seconds, and a complete answer (300-500 characters) takes about 5-8 seconds
  • JSON output mode is stable; response_format: { type: "json_object" } rarely returns malformed output

Overall assessment: under the constraint of “staying within the Cloudflare ecosystem,” this is currently the best Traditional Chinese LLM option.

Gemma 4 (2026 Update)

In 2026, Cloudflare Workers AI launched @cf/google/gemma-4-26b-a4b-it. Several key upgrades are worth noting:

Architecture change: MoE
Gemma 4 adopts a Mixture-of-Experts architecture. Total parameters are 26B, but only about 4B are activated per inference (a4b = active 4 billion). Actual inference speed is faster than Gemma 3 12B while performing better on most tasks.

256K context window
Gemma 3 on Workers AI only had 8K. Gemma 4’s 256K is a massive leap, directly benefiting RAG scenarios that need to fit large volumes of documents.

Vision support
You can pass in images for visual understanding, suitable for applications that need to analyze screenshots or charts.

Function calling
Native tool calling support, more reliable than forcing JSON through prompts, ideal for agentic workflows.

// Gemma 4 usage is the same as Gemma 3, just swap the model ID
const response = await env.AI.run("@cf/google/gemma-4-26b-a4b-it", {
  messages: [
    { role: "system", content: "You are an AI assistant for a Taiwan rock climbing community." },
    { role: "user", content: "What routes at Longdong are suitable for beginners?" }
  ],
  max_tokens: 1024,
  stream: true,
});

When to upgrade to Gemma 4?

  • Need to process long documents or multiple context sources -> 256K directly solves Gemma 3’s 8K limitation
  • Need function calling for agentic tasks -> Gemma 4 has native support
  • Need image understanding -> Gemma 4 supports vision
  • Text-only RAG with limited budget -> Gemma 3 12B is still sufficient, and the pricing structure differs (Gemma 3 has no public pricing, Gemma 4 is $0.10/$0.30 per M tokens)

References