Skip to content

2026 LLM Inference Provider Free Tiers & Pricing: 40+ Services Ranked by Tier

May 9, 2026 1 min
TL;DR For side projects, toy demos, and RAG prototypes, nobody wants to swipe a credit card on day one. This is a verified roundup of 40+ LLM inference providers still operating as of 2026/05, tiered by whether free resources auto-replenish or are one-time grants. Each entry notes credit-card requirements, supported models, paid starting prices, and catches. Chinese-origin providers including Zhipu GLM (permanently free), Doubao (2M tokens/day), Kimi, DashScope, and the Ollama local option are all included.

🌏 中文版

For side projects, toy demos, and RAG prototypes, nobody wants to swipe a credit card on day one. The problem is there are too many LLM inference providers, pricing pages change too fast, formerly free options may have been killed, and formerly paid ones may have gone permanently free. This article tiers 40+ options still operating as of 2026/05 by the nature of their free resources, noting credit-card requirements, key supported models, paid starting prices, and the catch for each free tier.

All numbers below are cross-referenced directly from official pricing pages. Where something couldn’t be verified, it’s explicitly marked “unverified” — nothing is fabricated to fill the table.

How the Three Tiers Are Defined

The key distinction is whether free resources auto-replenish or are one-time / hard-capped:

  • Tier 1: Daily/per-minute auto-resetting quotas, generous enough for daily development (thousands to tens of thousands of requests per day), on the provider’s own inference infrastructure. Use as your primary API.
  • Tier 2: Small monthly credits, one-time signup credits, or strict rate limits. Fine for experimenting, trying models, or as a fallback; will hit walls as a daily driver.
  • Tier 3: Paid only, no ongoing free tier. Focus on per-token pricing.

Two additional standalone sections cover: Completely Free (no SLA, experimental) and Chinese-origin providers (verified free tiers).

Tier 1: Auto-Replenishing Daily Quotas

Cerebras Inference

Wafer-scale LPU, 1000-3000 tps speed, tied with Groq as the “fastest + most generous free tier” in Tier 1.

  • Free quota: 30 RPM, 900 RPH, 14,400 RPD, 60K TPM, 1M TPH, 1M TPD per model (GLM-4.7 is tighter: 10 RPM, 100 RPD)
  • No credit card required — sign up and get an API key
  • Popular models: gpt-oss-120b, Qwen3-235B-Instruct, Llama 3.1 8B, ZAI GLM-4.7
  • Paid starting price (Developer tier, requires $10 deposit): Llama 3.1 8B $0.10/$0.10, gpt-oss-120b $0.35/$0.75, Qwen3-235B $0.60/$1.20, GLM 4.7 $2.25/$2.75
  • Highlight: All major models get 14.4K RPD — the most consistently generous free quotas
  • Catch: Llama 3.1 8B and Qwen3-235B-Instruct will be deprecated on 2026-05-27

Groq

LPU at 500-1000 tps, broadest open model lineup, most models on the free tier (including speech, moderation, and agentic).

  • Free quota (varies by model, numbers taken directly from console.groq.com/docs/rate-limits):
    • llama-3.1-8b-instant: 30 RPM / 14.4K RPD / 6K TPM / 500K TPD
    • llama-3.3-70b-versatile: 30 RPM / 1K RPD / 12K TPM / 100K TPD
    • meta-llama/llama-4-scout-17b: 30 RPM / 1K RPD / 30K TPM / 500K TPD
    • openai/gpt-oss-120b / gpt-oss-20b: 30 RPM / 1K RPD / 8K TPM / 200K TPD
    • qwen/qwen3-32b: 60 RPM / 1K RPD / 6K TPM / 500K TPD
    • Plus Whisper, Llama Guard, Compound (agentic), and more
  • No credit card required
  • Paid starting price: Llama 3.3 70B $0.59/$0.79, gpt-oss-20b $0.075/$0.30, gpt-oss-120b $0.15/$0.60, cached input 50% off
  • Highlight: Broadest model lineup (including speech, moderation, agentic); Llama 3.1 8B at 14.4K RPD matches Cerebras
  • Catch: Heavy models only get 1K RPD — you’ll bottleneck on volume (upgrade to Developer to unlock more)

Cloudflare Workers AI

Broadest model catalog, included with the Workers Free plan.

  • Free quota: 10,000 Neurons/day (available on both Free and Paid accounts)
  • No credit card required
  • Popular models: Llama 3.3 70B, gpt-oss-20b/120b, Qwen3-30B, DeepSeek-R1-distill, Kimi K2.6, GLM-4.7-flash, Gemma 3
  • Paid starting price: $0.011 / 1,000 Neurons; Llama 3.3 70B fp8-fast $0.293/$2.253, gpt-oss-120b $0.35/$0.75, gpt-oss-20b $0.20/$0.30
  • Catch: Neurons conversion means daily free volume is small (Llama 3.3 70B roughly 37K input + 5K output tokens) — heavy models burn through it fast

Google AI Studio (Gemini API)

Gemini 3 series official pipeline, 1.2M context window included.

  • Free quota: Free tier is entirely free, no credit card required (specific RPM/RPD shown dynamically in the AI Studio UI; official public pages don’t list exact numbers)
  • Popular models: Gemini 3 Pro Preview (actual model ID: gemini-3.1-pro-preview), Gemini 3 Flash Preview, Gemini 2.5 Pro/Flash/Flash-Lite
  • Paid starting price: Gemini 2.5 Flash-Lite $0.10/$0.40, 2.5 Flash $0.30/$2.50, Gemini 3 Flash Preview $0.50/$3.00, Gemini 3 Pro Preview $2/$12 (<=200K context)
  • Catch: Free tier prompts and outputs are used for model training (officially documented) — production projects should upgrade to Tier 1 (requires credit card) to disable this

Tier 2: Small Monthly / One-Time / Strict Rate Limits

(a) Small Monthly Credits (gone once used up that month)

Hugging Face Inference Providers

  • Free $0.10/month, PRO $2/month, Team / Enterprise $2/seat/month
  • No credit card required (uses monthly credits); zero markup, routes to Cerebras / Groq / Together / Fireworks / SambaNova / Hyperbolic behind the scenes
  • Catch: Free $0.10 is minuscule; PRO is where it starts being usable

Vercel AI Gateway

  • $5/month credits (clock starts on first request)
  • Standard provider pricing, BYOK also zero markup
  • Catch: Once $5 is used up, you need to top up

Modal

  • Starter $30/month permanent free credits, including 100 containers + 10 GPU concurrency
  • No credit card required
  • Highlight: Serverless GPU to run your own vLLM/SGLang, billed per second (H100 ~$3.95/hr)
  • Catch: You deploy models yourself — this isn’t a ready-made token API

(b) One-Time Signup Credits

SambaNova Cloud

  • Sign up for $5 credits (valid 30 days); after credits expire, the Free tier persists (doesn’t disappear)
  • Free tier (no credit card required): DeepSeek-V3.1, Llama 3.3 70B, gpt-oss-120b each at 20 RPM / 20 RPD / 200K TPD
  • RDU chip, speed on par with Groq / Cerebras
  • Paid: Llama 3.3 70B $0.60/$1.20, gpt-oss-120b $0.22/$0.59, DeepSeek-V3.1 $0.15/$0.75
  • Catch: Developer tier (requires credit card) to unlock 60 RPM / 12K RPD

Inference.net

  • $25 one-time free credits
  • Claims 90% cheaper than OpenAI
  • Key models: Nemotron 3 Super $2.50/$5, Schematron series (specialized small models), Gemma 3
  • Catch: Model selection skews research-oriented

AI21 Jamba

  • $10 / 7-day trial, no credit card required
  • Jamba Mini $0.2/$0.4, Jamba Large $2/$8
  • Highlight: Jamba long context, Mamba architecture
  • Catch: Trial expires after 7 days

Baseten

  • New workspace gets $30 one-time free credits (per official changelog)
  • Basic plan $0/month, pay-as-you-go; DeepSeek V4 $1.74/$3.48, gpt-oss-120B $0.10/$0.50, Kimi K2.6 $1.00/$3.90
  • Highlight: Supports both Model API (token-based billing) and Dedicated GPU Deployment (per-minute billing, starting at T4 $0.01052/min)
  • Catch: Need to top up after $30 is spent; rate limits are low (Basic unverified at 15 RPM / 100K TPM)

(c) Strict Rate Limits (no large token quotas)

OpenRouter

  • :free models at 20 RPM; cumulative purchases <$10 -> 50 RPD; purchases >=$10 -> 1000 RPD
  • No credit card required for free models (DeepSeek-V3, Llama 3.3 70B, Qwen3, etc.)
  • Paid requests forwarded at provider cost, zero markup
  • Catch: :free models have worse context and throughput, may fall back, and prompts may be collected by providers

GitHub Models

  • Copilot Free/Pro: Low models 15 RPM / 150 RPD; High models 10 RPM / 50 RPD; Embedding 15 RPM / 150 RPD; most limited to 8K input / 4K output
  • The only legitimate free channel to try GPT-5 / o3 (also includes o4-mini, Llama, Phi, Mistral, DeepSeek-R1, Grok-3)
  • Catch: Quotas are very tight — only enough to dip your toes

Cohere Trial Key

  • 1,000 calls/month; Chat 20 RPM, Embed 2,000 inputs/min, Rerank 10 RPM
  • No credit card required; Command A, Embed, Rerank are well-suited for RAG
  • Catch: 1,000 calls/month runs out fast

(d) Quota Unclear but Confirmed Free Dev Tier

NVIDIA NIM (build.nvidia.com)

  • Sign up for 1,000 inference credits; providing a business email can unlock an additional 4,000 (5,000 total), along with a 90-day NVIDIA AI Enterprise free trial
  • Credits don’t expire; 40 RPM (can request increase to 200 RPM)
  • Broadest model lineup: Nemotron-3 Super 120B, DeepSeek V4, Llama 3.3 70B, Kimi K2, Qwen3.5 122B, gpt-oss, Gemma 4, GLM-5.1
  • Highlight: Official NVIDIA-optimized; enterprise version requires DGX Cloud entitlement
  • Catch: Credits are for development / prototyping, not production use

Nebius Token Factory (the company that acquired Tavily)

  • New accounts get $1 trial credit (valid 30 days); credit card required to complete onboarding
  • Models: gpt-oss-120B, Kimi-K2, Hermes-4-405B, GLM-4.5, Qwen3-Coder-480B, DeepSeek-R1-0528
  • Highlight: Sub-second latency, SOC2/HIPAA, US/EU regions
  • Catch: $1 is tiny — basically enough for one or two requests

Completely Free (No SLA, Experimental)

Pollinations.ai

  • Completely free, pollen auto-replenishes (Seed 0.15 pollen/hr, Flower 0.4 pollen/hr)
  • OpenAI-compatible API, no credit card required
  • Key models: Gemma 4 26B, Seedance 2.0 video, text embedding
  • Suitable for prototypes, not for SLA requirements

AI Horde

  • Completely free + anonymous access (API key 0000000000 works directly)
  • Community volunteer GPUs, ~441 tokens/sec, NLnet/NGI0 funded
  • Highlight: Contribute GPU to earn kudos for priority
  • Catch: Speed depends on current volunteer count, model availability fluctuates, absolutely never use in production

Ollama (Local Inference)

Local model runner — install it on your machine and run open-source LLMs; also offers a cloud tier for models too large for consumer hardware.

  • Local inference: Completely free and unlimited, runs on your own GPU/CPU, supports offline use
  • Cloud free tier: 1 concurrent model, limited GPU time (per session every 5 hours, weekly auto-reset every 7 days)
  • No credit card required (both local and cloud free tier)
  • Paid Pro $20/month ($200/year): 3 concurrent cloud models, 50x more cloud usage, private model uploads
  • Paid Max $100/month: 10 concurrent cloud models
  • Model library: Qwen3.5, Gemma 4, DeepSeek V4, Kimi K2.6, GLM-5.1, Mistral Medium 3.5, Llama series, and hundreds more
  • OpenAI-compatible REST API: Just change the base URL for a seamless switch from OpenAI; supports tool calling
  • Cloud-only models (too large for local): DeepSeek V4 Pro 684B MoE, Kimi K2.6, and other massive MoE models
  • Privacy: Neither local nor cloud prompts/responses are logged or used for training; cloud runs on NVIDIA Cloud (US/EU/Singapore), zero data retention
  • Catch: Cloud tier is limited by GPU time rather than token count — high concurrency requires a paid plan; only runs open models, no GPT / Claude

Chinese-Origin Providers (Verified Free Tiers)

Chinese-origin providers generally offer ongoing free tiers or aggressive promotions, but their pricing pages are notoriously hostile to scraping from outside China. Below are the ones where specific numbers were directly verified this round:

iFlytek Spark Lite (Xunfei)

  • Spark Lite model permanently free and unlimited
  • Individual verification grants 200K tokens; enterprise gets 1M tokens
  • Paid: Spark X2 CNY 2-3/M, X2 Flash CNY 1-2/M, Ultra CNY 0.8/M, Pro CNY 5/M
  • The most generous free tier among Chinese-origin providers; requires identity verification

Tencent Hunyuan (Tencent)

  • First activation grants 1M tokens valid for one year (shared across Hunyuan 2.0 Think/Instruct/T1/TurboS/a13b/Vision/embedding)
  • Hunyuan-lite completely free
  • Paid: HY 2.0 Think CNY 3.975/CNY 15.9 per M, Hunyuan-T1 CNY 1/CNY 4
  • Transparent and genuine free tier from a major tech company

Baidu Qianfan

  • Sign up for a CNY 20 voucher (platform-wide, no minimum spend, valid for 1 month)
  • Qwen3.5-2B inference free and unlimited; Qwen-Image-2512 temporarily free
  • Comprehensive model marketplace: DeepSeek-V4, ERNIE 5.0, ERNIE 4.5 Turbo, Kimi-K2.5, MiniMax-M2.1, Qwen3-VL-32B, GLM 5.1
  • Requires identity verification

Zhipu GLM (Zhipu AI)

Multiple Flash models are permanently free, making this one of the most generous free tiers among Chinese-origin providers.

  • Permanently free models: GLM-4-Flash (128K), GLM-4.7-Flash (200K), GLM-4.5-Flash, GLM-4V-Flash (multimodal vision), and more — no token cap, 30 concurrent limit
  • No credit card required; requires identity verification
  • New user bonus: 20M tokens (GLM-4.5-Air equivalent, market value CNY 58)
  • Paid pricing (CNY per million tokens): GLM-5.1 CNY 6/CNY 24, GLM-4.7 CNY 2/CNY 8, GLM-4.5 CNY 1/CNY 4, GLM-4.5-Air CNY 0.8/CNY 2-8, GLM-Z1-Air (reasoning) CNY 0.5/CNY 0.5
  • Highlight: Flash series covers text, multimodal, and reasoning — broadest permanently free coverage
  • Catch: open.bigmodel.cn access is unstable outside China; 30 concurrent is fine for development, production should upgrade to paid

Volcengine Doubao (ByteDance)

Two-layer free plan: model trial quota + 2M tokens/day collaboration reward.

  • Trial mode: Major models each grant 500K tokens (one-time), automatically activated on login
  • Collaboration reward program: 2M tokens/day, auto-reset (must be manually activated in the console; covers Doubao, Qwen, DeepSeek, Kimi, MiniMax, GLM, and more)
  • No credit card required; requires identity verification
  • Paid pricing (CNY per million tokens): Doubao-Seed-2.0-mini CNY 0.2/CNY 2.0, Seed-2.0-lite CNY 0.6/CNY 3.6, Seed-2.0-pro CNY 3.2/CNY 16 (<=32K context); Doubao-1.5-lite CNY 0.3/CNY 0.6, 1.5-pro CNY 0.8/CNY 2; DeepSeek-V3 CNY 2/CNY 8, R1 CNY 4/CNY 16
  • Catch: Collaboration reward must be manually activated to take effect; Seed series pricing tiers by context length, jumping significantly above 32K

Qwen DashScope (Alibaba Cloud Bailian)

New users get 1M tokens per model, valid for 90 days; the “70M tokens” figure is a marketing total, not a per-model quota.

  • New user free quota: Approximately 70 supported models each grant 1M tokens, valid for 90 days (not permanent); summing these yields the “70M tokens” marketing figure
  • No credit card required; requires identity verification (Alibaba Cloud account)
  • Paid pricing (CNY per million tokens, <=128K input): qwen-turbo CNY 0.3/CNY 0.6 (thinking mode output CNY 3), qwen-plus CNY 0.8/CNY 2 (thinking CNY 8), qwen-max CNY 2.4/CNY 9.6, qwen3-max (<=32K) CNY 2.5/CNY 10; Batch API 50% off across the board
  • Catch: Free quota vanishes after 90 days; pricing page is JS-rendered, requires a logged-in account from outside China to see full numbers

Moonshot Kimi Open Platform

No permanent free tier; new users get a CNY 15 trial voucher; K2.6 is the current flagship, K2 series will be decommissioned on 2026-05-25.

  • New users: CNY 15 free trial voucher (requires Chinese phone number), valid for 3 months; API returns 403 once depleted
  • K2 series (K2 0711 / K2 0905): Officially decommissioned on 2026-05-25; official migration path is K2.5 / K2.6
  • Paid pricing (CNY per million tokens): Kimi K2.6 input CNY 6.50 (cache hit CNY 1.10) / output CNY 27 (256K context); Kimi K2.5 CNY 4.00 (cache CNY 0.70) / CNY 21; Moonshot V1 8K $0.20/$2.00 (USD)
  • Catch: K2.6 is roughly 60% more expensive than K2.5; rate limit tiers unlock via cumulative top-ups; no ongoing free tier for international users

Tier 3: Paid Only (Cheap Per-Token)

ServiceFreePaid Starting PriceNotes
DeepInfraNoneLlama 3.1 8B $0.02/$0.05, Qwen3-235B-A22B-Instruct $0.071/$0.10, DeepSeek-V3.2 $0.26/$0.38 (cached $0.13)Among the cheapest per-token in the market
Novita AINoneDeepSeek-V4-Flash $0.14/$0.28, Llama 3.3 70B $0.135/$0.4, Qwen3-235B $0.09/$0.58, GLM 4.5 Air $0.13/$0.85Extremely comprehensive model catalog (including audio/video), very competitive pricing
Together AINone (minimum $5 deposit required, no automatic credits)gpt-oss-20B $0.05/$0.20, gpt-oss-120B $0.15/$0.60, Llama 3.3 70B $0.88/$0.88, DeepSeek-V3.1 $0.60/$1.70Broadest model selection; Startup Accelerator offers $15K-$50K credits on application
Fireworks AI$1 signup creditsCached input automatic 50% off, batch 50% offDetailed pricing on docs.fireworks.ai subdomain
DeepSeek PlatformNonev4-flash $0.14/$0.28 (cache hit $0.0028), v4-pro 75% off promotional pricing $0.435/$0.87 (promotion ends 2026-05-31, regular price $1.74/$3.48)Cheapest for their own flagship models
xAI GrokNo fixed free tiergrok-4.3 $1.25/$2.50, grok-4-1-fast $0.20/$0.50 (retiring 2026-05-15), grok-4.20 $1.25/$2.50”Share data for $25/month” not currently mentioned on docs/models page
Perplexity SonarNoneSonar $1/$1 (token) + Search API $5/1K req; Sonar Pro $3/$15; Deep Research $2/$8 + additional surchargesPrice includes built-in web search
ReplicateNo ongoing free tierBilled per secondNot cost-effective for LLMs; primarily an image/video platform
ChutesNo true free tier (minimum $3/month subscription)$3 (Base) / $10 (Plus) / $20 (Pro)Decentralized, TEE confidential inference, fastest to list SOTA OSS models
Mistral La PlateformeNone (Le Chat chat UI is free, API has no free tier)Large 3 $0.50/$1.50, Small 4 $0.15/$0.60, Codestral $0.30/$0.90, Medium 3.5 $1.50/$7.50, Magistral Medium $2/$5; batch 50% off across the boardCodestral has moved to paid (Premier); Ministral Edge series $0.10-$0.20 per M flat
HyperbolicNoneServerless pay-as-you-go starting ~$0.10/1M tokens; GPU on-demand starting $1.39/hr (H100/H200)Also offers per-hour GPU rental and reserved clusters (contact sales)
MiniMax / HailuoNone (subscription-based, starting $10/month)M2.7 $0.30/$1.20, M2.7-highspeed $0.60/$2.40; Starter Token Plan $10/month (1,500 req/5hr)Includes Hailuo 2.3 video generation (768P 6s from $0.19 Fast); Chinese model, global API
Featherless AINone (Agent plan has 3-day trial)Basic $10/month (<=15B models, unlimited tokens); Premium $25/month (any size); Agent $100/month+30,000+ Hugging Face models, flat-rate unlimited tokens; subscription-based, not per-token
Anthropic / OpenAIPrevious trial credit policies not verified on current pricing pagesClaude Haiku 4.5 $1/$5, GPT-5.4 mini $0.75/$4.50Paid only; trying via OpenRouter / Vercel Gateway is more cost-effective

Confirmed Shutdowns

  • 01.AI Yi: English API shut down on 2025-08-25; international version no longer operational

Side Projects / Toy Demos

Stack four providers as your primary setup — all free, no credit card required:

  • Cerebras: Run large models like Qwen3-235B, gpt-oss-120b at top speed
  • Groq: Run Llama 3.3 70B, Kimi K2, Whisper (speech) — broadest model lineup
  • Cloudflare Workers AI: Run RAG / embedding, integrated with Workers / D1 / Vectorize
  • Google AI Studio: Run Gemini 3 Flash for multimodal and long context experiments

Stack these four and it’s very hard to exhaust RPM/RPD limits during daily development.

Self-Hosted / Serverless GPU

  • Modal: $30/month permanent credits to run your own vLLM/SGLang
  • NVIDIA NIM: Free for dev (exact quota unclear), broadest model catalog, official optimizations

Fallback / Routing Convenience

OpenRouter :free + HF Inference Providers PRO + Vercel AI Gateway $5/month make the backup trio.

Production Paid (Cheapest Per-Token)

  • DeepInfra (per-token king, but no free tier)
  • Novita AI (includes audio/video, extremely competitive pricing)
  • Groq (best of both speed and price)
  • DeepSeek’s own v4-flash ($0.14/$0.28)

China Market

Stack four permanently free / daily free providers for minimum effort:

  • Zhipu GLM-4.7-Flash: Permanently free, 200K context, no token cap (30 concurrent)
  • iFlytek Spark Lite: Permanently free and unlimited
  • Volcengine Doubao Collaboration Reward: Manually activate for 2M tokens/day auto-reset — best for volume
  • Tencent Hunyuan-lite: Completely free + 1M tokens on first activation

New users can additionally stack: Qwen DashScope (1M per model / 90 days) + Baidu Qianfan (CNY 20 voucher + free Qwen) + Kimi (CNY 15 voucher) for enough credits to try out models.

Common Catches to Keep in Mind

  • Free tiers typically collect your prompts for training / evaluation / safety analysis — use paid keys for production projects
  • Model deprecation moves fast: 5/15 grok-4-1-fast retiring, 5/27 Cerebras Llama 3.1 8B / Qwen3-235B, 5/31 DeepSeek v4-pro discount ending — add these to your calendar
  • RPM/RPD caps are per API key / organization — using multiple accounts to circumvent limits typically violates ToS
  • “No credit card” does not equal “forever free”: All free tiers can be adjusted without notice — don’t skip feature flags

Overall, the good news in 2026 is that free resources are far more abundant than in 2024 — individual developers have no shortage of LLM APIs. The bad news is this market layer moves extremely fast, and any roundup from six months ago is likely already inaccurate. If you’re reading this six months after publication, I recommend clicking the official links below to verify again.


References