- How the Three Tiers Are Defined
- Tier 1: Auto-Replenishing Daily Quotas
- Tier 2: Small Monthly / One-Time / Strict Rate Limits
- Completely Free (No SLA, Experimental)
- Chinese-Origin Providers (Verified Free Tiers)
- Tier 3: Paid Only (Cheap Per-Token)
- Confirmed Shutdowns
- Recommended Combinations
- Common Catches to Keep in Mind
- References
🌏 中文版
For side projects, toy demos, and RAG prototypes, nobody wants to swipe a credit card on day one. The problem is there are too many LLM inference providers, pricing pages change too fast, formerly free options may have been killed, and formerly paid ones may have gone permanently free. This article tiers 40+ options still operating as of 2026/05 by the nature of their free resources, noting credit-card requirements, key supported models, paid starting prices, and the catch for each free tier.
All numbers below are cross-referenced directly from official pricing pages. Where something couldn’t be verified, it’s explicitly marked “unverified” — nothing is fabricated to fill the table.
How the Three Tiers Are Defined
The key distinction is whether free resources auto-replenish or are one-time / hard-capped:
- Tier 1: Daily/per-minute auto-resetting quotas, generous enough for daily development (thousands to tens of thousands of requests per day), on the provider’s own inference infrastructure. Use as your primary API.
- Tier 2: Small monthly credits, one-time signup credits, or strict rate limits. Fine for experimenting, trying models, or as a fallback; will hit walls as a daily driver.
- Tier 3: Paid only, no ongoing free tier. Focus on per-token pricing.
Two additional standalone sections cover: Completely Free (no SLA, experimental) and Chinese-origin providers (verified free tiers).
Tier 1: Auto-Replenishing Daily Quotas
Cerebras Inference
Wafer-scale LPU, 1000-3000 tps speed, tied with Groq as the “fastest + most generous free tier” in Tier 1.
- Free quota: 30 RPM, 900 RPH, 14,400 RPD, 60K TPM, 1M TPH, 1M TPD per model (GLM-4.7 is tighter: 10 RPM, 100 RPD)
- No credit card required — sign up and get an API key
- Popular models: gpt-oss-120b, Qwen3-235B-Instruct, Llama 3.1 8B, ZAI GLM-4.7
- Paid starting price (Developer tier, requires $10 deposit): Llama 3.1 8B $0.10/$0.10, gpt-oss-120b $0.35/$0.75, Qwen3-235B $0.60/$1.20, GLM 4.7 $2.25/$2.75
- Highlight: All major models get 14.4K RPD — the most consistently generous free quotas
- Catch: Llama 3.1 8B and Qwen3-235B-Instruct will be deprecated on 2026-05-27
Groq
LPU at 500-1000 tps, broadest open model lineup, most models on the free tier (including speech, moderation, and agentic).
- Free quota (varies by model, numbers taken directly from console.groq.com/docs/rate-limits):
llama-3.1-8b-instant: 30 RPM / 14.4K RPD / 6K TPM / 500K TPDllama-3.3-70b-versatile: 30 RPM / 1K RPD / 12K TPM / 100K TPDmeta-llama/llama-4-scout-17b: 30 RPM / 1K RPD / 30K TPM / 500K TPDopenai/gpt-oss-120b/gpt-oss-20b: 30 RPM / 1K RPD / 8K TPM / 200K TPDqwen/qwen3-32b: 60 RPM / 1K RPD / 6K TPM / 500K TPD- Plus Whisper, Llama Guard, Compound (agentic), and more
- No credit card required
- Paid starting price: Llama 3.3 70B $0.59/$0.79, gpt-oss-20b $0.075/$0.30, gpt-oss-120b $0.15/$0.60, cached input 50% off
- Highlight: Broadest model lineup (including speech, moderation, agentic); Llama 3.1 8B at 14.4K RPD matches Cerebras
- Catch: Heavy models only get 1K RPD — you’ll bottleneck on volume (upgrade to Developer to unlock more)
Cloudflare Workers AI
Broadest model catalog, included with the Workers Free plan.
- Free quota: 10,000 Neurons/day (available on both Free and Paid accounts)
- No credit card required
- Popular models: Llama 3.3 70B, gpt-oss-20b/120b, Qwen3-30B, DeepSeek-R1-distill, Kimi K2.6, GLM-4.7-flash, Gemma 3
- Paid starting price: $0.011 / 1,000 Neurons; Llama 3.3 70B fp8-fast $0.293/$2.253, gpt-oss-120b $0.35/$0.75, gpt-oss-20b $0.20/$0.30
- Catch: Neurons conversion means daily free volume is small (Llama 3.3 70B roughly 37K input + 5K output tokens) — heavy models burn through it fast
Google AI Studio (Gemini API)
Gemini 3 series official pipeline, 1.2M context window included.
- Free quota: Free tier is entirely free, no credit card required (specific RPM/RPD shown dynamically in the AI Studio UI; official public pages don’t list exact numbers)
- Popular models: Gemini 3 Pro Preview (actual model ID:
gemini-3.1-pro-preview), Gemini 3 Flash Preview, Gemini 2.5 Pro/Flash/Flash-Lite - Paid starting price: Gemini 2.5 Flash-Lite $0.10/$0.40, 2.5 Flash $0.30/$2.50, Gemini 3 Flash Preview $0.50/$3.00, Gemini 3 Pro Preview $2/$12 (<=200K context)
- Catch: Free tier prompts and outputs are used for model training (officially documented) — production projects should upgrade to Tier 1 (requires credit card) to disable this
Tier 2: Small Monthly / One-Time / Strict Rate Limits
(a) Small Monthly Credits (gone once used up that month)
Hugging Face Inference Providers
- Free $0.10/month, PRO $2/month, Team / Enterprise $2/seat/month
- No credit card required (uses monthly credits); zero markup, routes to Cerebras / Groq / Together / Fireworks / SambaNova / Hyperbolic behind the scenes
- Catch: Free $0.10 is minuscule; PRO is where it starts being usable
- $5/month credits (clock starts on first request)
- Standard provider pricing, BYOK also zero markup
- Catch: Once $5 is used up, you need to top up
- Starter $30/month permanent free credits, including 100 containers + 10 GPU concurrency
- No credit card required
- Highlight: Serverless GPU to run your own vLLM/SGLang, billed per second (H100 ~$3.95/hr)
- Catch: You deploy models yourself — this isn’t a ready-made token API
(b) One-Time Signup Credits
- Sign up for $5 credits (valid 30 days); after credits expire, the Free tier persists (doesn’t disappear)
- Free tier (no credit card required): DeepSeek-V3.1, Llama 3.3 70B, gpt-oss-120b each at 20 RPM / 20 RPD / 200K TPD
- RDU chip, speed on par with Groq / Cerebras
- Paid: Llama 3.3 70B $0.60/$1.20, gpt-oss-120b $0.22/$0.59, DeepSeek-V3.1 $0.15/$0.75
- Catch: Developer tier (requires credit card) to unlock 60 RPM / 12K RPD
- $25 one-time free credits
- Claims 90% cheaper than OpenAI
- Key models: Nemotron 3 Super $2.50/$5, Schematron series (specialized small models), Gemma 3
- Catch: Model selection skews research-oriented
- $10 / 7-day trial, no credit card required
- Jamba Mini $0.2/$0.4, Jamba Large $2/$8
- Highlight: Jamba long context, Mamba architecture
- Catch: Trial expires after 7 days
- New workspace gets $30 one-time free credits (per official changelog)
- Basic plan $0/month, pay-as-you-go; DeepSeek V4 $1.74/$3.48, gpt-oss-120B $0.10/$0.50, Kimi K2.6 $1.00/$3.90
- Highlight: Supports both Model API (token-based billing) and Dedicated GPU Deployment (per-minute billing, starting at T4 $0.01052/min)
- Catch: Need to top up after $30 is spent; rate limits are low (Basic unverified at 15 RPM / 100K TPM)
(c) Strict Rate Limits (no large token quotas)
:freemodels at 20 RPM; cumulative purchases <$10 -> 50 RPD; purchases >=$10 -> 1000 RPD- No credit card required for free models (DeepSeek-V3, Llama 3.3 70B, Qwen3, etc.)
- Paid requests forwarded at provider cost, zero markup
- Catch:
:freemodels have worse context and throughput, may fall back, and prompts may be collected by providers
- Copilot Free/Pro: Low models 15 RPM / 150 RPD; High models 10 RPM / 50 RPD; Embedding 15 RPM / 150 RPD; most limited to 8K input / 4K output
- The only legitimate free channel to try GPT-5 / o3 (also includes o4-mini, Llama, Phi, Mistral, DeepSeek-R1, Grok-3)
- Catch: Quotas are very tight — only enough to dip your toes
Cohere Trial Key
- 1,000 calls/month; Chat 20 RPM, Embed 2,000 inputs/min, Rerank 10 RPM
- No credit card required; Command A, Embed, Rerank are well-suited for RAG
- Catch: 1,000 calls/month runs out fast
(d) Quota Unclear but Confirmed Free Dev Tier
NVIDIA NIM (build.nvidia.com)
- Sign up for 1,000 inference credits; providing a business email can unlock an additional 4,000 (5,000 total), along with a 90-day NVIDIA AI Enterprise free trial
- Credits don’t expire; 40 RPM (can request increase to 200 RPM)
- Broadest model lineup: Nemotron-3 Super 120B, DeepSeek V4, Llama 3.3 70B, Kimi K2, Qwen3.5 122B, gpt-oss, Gemma 4, GLM-5.1
- Highlight: Official NVIDIA-optimized; enterprise version requires DGX Cloud entitlement
- Catch: Credits are for development / prototyping, not production use
Nebius Token Factory (the company that acquired Tavily)
- New accounts get $1 trial credit (valid 30 days); credit card required to complete onboarding
- Models: gpt-oss-120B, Kimi-K2, Hermes-4-405B, GLM-4.5, Qwen3-Coder-480B, DeepSeek-R1-0528
- Highlight: Sub-second latency, SOC2/HIPAA, US/EU regions
- Catch: $1 is tiny — basically enough for one or two requests
Completely Free (No SLA, Experimental)
Pollinations.ai
- Completely free, pollen auto-replenishes (Seed 0.15 pollen/hr, Flower 0.4 pollen/hr)
- OpenAI-compatible API, no credit card required
- Key models: Gemma 4 26B, Seedance 2.0 video, text embedding
- Suitable for prototypes, not for SLA requirements
AI Horde
- Completely free + anonymous access (API key
0000000000works directly) - Community volunteer GPUs, ~441 tokens/sec, NLnet/NGI0 funded
- Highlight: Contribute GPU to earn kudos for priority
- Catch: Speed depends on current volunteer count, model availability fluctuates, absolutely never use in production
Ollama (Local Inference)
Local model runner — install it on your machine and run open-source LLMs; also offers a cloud tier for models too large for consumer hardware.
- Local inference: Completely free and unlimited, runs on your own GPU/CPU, supports offline use
- Cloud free tier: 1 concurrent model, limited GPU time (per session every 5 hours, weekly auto-reset every 7 days)
- No credit card required (both local and cloud free tier)
- Paid Pro $20/month ($200/year): 3 concurrent cloud models, 50x more cloud usage, private model uploads
- Paid Max $100/month: 10 concurrent cloud models
- Model library: Qwen3.5, Gemma 4, DeepSeek V4, Kimi K2.6, GLM-5.1, Mistral Medium 3.5, Llama series, and hundreds more
- OpenAI-compatible REST API: Just change the base URL for a seamless switch from OpenAI; supports tool calling
- Cloud-only models (too large for local): DeepSeek V4 Pro 684B MoE, Kimi K2.6, and other massive MoE models
- Privacy: Neither local nor cloud prompts/responses are logged or used for training; cloud runs on NVIDIA Cloud (US/EU/Singapore), zero data retention
- Catch: Cloud tier is limited by GPU time rather than token count — high concurrency requires a paid plan; only runs open models, no GPT / Claude
Chinese-Origin Providers (Verified Free Tiers)
Chinese-origin providers generally offer ongoing free tiers or aggressive promotions, but their pricing pages are notoriously hostile to scraping from outside China. Below are the ones where specific numbers were directly verified this round:
iFlytek Spark Lite (Xunfei)
- Spark Lite model permanently free and unlimited
- Individual verification grants 200K tokens; enterprise gets 1M tokens
- Paid: Spark X2 CNY 2-3/M, X2 Flash CNY 1-2/M, Ultra CNY 0.8/M, Pro CNY 5/M
- The most generous free tier among Chinese-origin providers; requires identity verification
Tencent Hunyuan (Tencent)
- First activation grants 1M tokens valid for one year (shared across Hunyuan 2.0 Think/Instruct/T1/TurboS/a13b/Vision/embedding)
- Hunyuan-lite completely free
- Paid: HY 2.0 Think CNY 3.975/CNY 15.9 per M, Hunyuan-T1 CNY 1/CNY 4
- Transparent and genuine free tier from a major tech company
Baidu Qianfan
- Sign up for a CNY 20 voucher (platform-wide, no minimum spend, valid for 1 month)
- Qwen3.5-2B inference free and unlimited; Qwen-Image-2512 temporarily free
- Comprehensive model marketplace: DeepSeek-V4, ERNIE 5.0, ERNIE 4.5 Turbo, Kimi-K2.5, MiniMax-M2.1, Qwen3-VL-32B, GLM 5.1
- Requires identity verification
Zhipu GLM (Zhipu AI)
Multiple Flash models are permanently free, making this one of the most generous free tiers among Chinese-origin providers.
- Permanently free models: GLM-4-Flash (128K), GLM-4.7-Flash (200K), GLM-4.5-Flash, GLM-4V-Flash (multimodal vision), and more — no token cap, 30 concurrent limit
- No credit card required; requires identity verification
- New user bonus: 20M tokens (GLM-4.5-Air equivalent, market value CNY 58)
- Paid pricing (CNY per million tokens): GLM-5.1 CNY 6/CNY 24, GLM-4.7 CNY 2/CNY 8, GLM-4.5 CNY 1/CNY 4, GLM-4.5-Air CNY 0.8/CNY 2-8, GLM-Z1-Air (reasoning) CNY 0.5/CNY 0.5
- Highlight: Flash series covers text, multimodal, and reasoning — broadest permanently free coverage
- Catch: open.bigmodel.cn access is unstable outside China; 30 concurrent is fine for development, production should upgrade to paid
Volcengine Doubao (ByteDance)
Two-layer free plan: model trial quota + 2M tokens/day collaboration reward.
- Trial mode: Major models each grant 500K tokens (one-time), automatically activated on login
- Collaboration reward program: 2M tokens/day, auto-reset (must be manually activated in the console; covers Doubao, Qwen, DeepSeek, Kimi, MiniMax, GLM, and more)
- No credit card required; requires identity verification
- Paid pricing (CNY per million tokens): Doubao-Seed-2.0-mini CNY 0.2/CNY 2.0, Seed-2.0-lite CNY 0.6/CNY 3.6, Seed-2.0-pro CNY 3.2/CNY 16 (<=32K context); Doubao-1.5-lite CNY 0.3/CNY 0.6, 1.5-pro CNY 0.8/CNY 2; DeepSeek-V3 CNY 2/CNY 8, R1 CNY 4/CNY 16
- Catch: Collaboration reward must be manually activated to take effect; Seed series pricing tiers by context length, jumping significantly above 32K
Qwen DashScope (Alibaba Cloud Bailian)
New users get 1M tokens per model, valid for 90 days; the “70M tokens” figure is a marketing total, not a per-model quota.
- New user free quota: Approximately 70 supported models each grant 1M tokens, valid for 90 days (not permanent); summing these yields the “70M tokens” marketing figure
- No credit card required; requires identity verification (Alibaba Cloud account)
- Paid pricing (CNY per million tokens, <=128K input): qwen-turbo CNY 0.3/CNY 0.6 (thinking mode output CNY 3), qwen-plus CNY 0.8/CNY 2 (thinking CNY 8), qwen-max CNY 2.4/CNY 9.6, qwen3-max (<=32K) CNY 2.5/CNY 10; Batch API 50% off across the board
- Catch: Free quota vanishes after 90 days; pricing page is JS-rendered, requires a logged-in account from outside China to see full numbers
Moonshot Kimi Open Platform
No permanent free tier; new users get a CNY 15 trial voucher; K2.6 is the current flagship, K2 series will be decommissioned on 2026-05-25.
- New users: CNY 15 free trial voucher (requires Chinese phone number), valid for 3 months; API returns 403 once depleted
- K2 series (K2 0711 / K2 0905): Officially decommissioned on 2026-05-25; official migration path is K2.5 / K2.6
- Paid pricing (CNY per million tokens): Kimi K2.6 input CNY 6.50 (cache hit CNY 1.10) / output CNY 27 (256K context); Kimi K2.5 CNY 4.00 (cache CNY 0.70) / CNY 21; Moonshot V1 8K $0.20/$2.00 (USD)
- Catch: K2.6 is roughly 60% more expensive than K2.5; rate limit tiers unlock via cumulative top-ups; no ongoing free tier for international users
Tier 3: Paid Only (Cheap Per-Token)
| Service | Free | Paid Starting Price | Notes |
|---|---|---|---|
| DeepInfra | None | Llama 3.1 8B $0.02/$0.05, Qwen3-235B-A22B-Instruct $0.071/$0.10, DeepSeek-V3.2 $0.26/$0.38 (cached $0.13) | Among the cheapest per-token in the market |
| Novita AI | None | DeepSeek-V4-Flash $0.14/$0.28, Llama 3.3 70B $0.135/$0.4, Qwen3-235B $0.09/$0.58, GLM 4.5 Air $0.13/$0.85 | Extremely comprehensive model catalog (including audio/video), very competitive pricing |
| Together AI | None (minimum $5 deposit required, no automatic credits) | gpt-oss-20B $0.05/$0.20, gpt-oss-120B $0.15/$0.60, Llama 3.3 70B $0.88/$0.88, DeepSeek-V3.1 $0.60/$1.70 | Broadest model selection; Startup Accelerator offers $15K-$50K credits on application |
| Fireworks AI | $1 signup credits | Cached input automatic 50% off, batch 50% off | Detailed pricing on docs.fireworks.ai subdomain |
| DeepSeek Platform | None | v4-flash $0.14/$0.28 (cache hit $0.0028), v4-pro 75% off promotional pricing $0.435/$0.87 (promotion ends 2026-05-31, regular price $1.74/$3.48) | Cheapest for their own flagship models |
| xAI Grok | No fixed free tier | grok-4.3 $1.25/$2.50, grok-4-1-fast $0.20/$0.50 (retiring 2026-05-15), grok-4.20 $1.25/$2.50 | ”Share data for $25/month” not currently mentioned on docs/models page |
| Perplexity Sonar | None | Sonar $1/$1 (token) + Search API $5/1K req; Sonar Pro $3/$15; Deep Research $2/$8 + additional surcharges | Price includes built-in web search |
| Replicate | No ongoing free tier | Billed per second | Not cost-effective for LLMs; primarily an image/video platform |
| Chutes | No true free tier (minimum $3/month subscription) | $3 (Base) / $10 (Plus) / $20 (Pro) | Decentralized, TEE confidential inference, fastest to list SOTA OSS models |
| Mistral La Plateforme | None (Le Chat chat UI is free, API has no free tier) | Large 3 $0.50/$1.50, Small 4 $0.15/$0.60, Codestral $0.30/$0.90, Medium 3.5 $1.50/$7.50, Magistral Medium $2/$5; batch 50% off across the board | Codestral has moved to paid (Premier); Ministral Edge series $0.10-$0.20 per M flat |
| Hyperbolic | None | Serverless pay-as-you-go starting ~$0.10/1M tokens; GPU on-demand starting $1.39/hr (H100/H200) | Also offers per-hour GPU rental and reserved clusters (contact sales) |
| MiniMax / Hailuo | None (subscription-based, starting $10/month) | M2.7 $0.30/$1.20, M2.7-highspeed $0.60/$2.40; Starter Token Plan $10/month (1,500 req/5hr) | Includes Hailuo 2.3 video generation (768P 6s from $0.19 Fast); Chinese model, global API |
| Featherless AI | None (Agent plan has 3-day trial) | Basic $10/month (<=15B models, unlimited tokens); Premium $25/month (any size); Agent $100/month+ | 30,000+ Hugging Face models, flat-rate unlimited tokens; subscription-based, not per-token |
| Anthropic / OpenAI | Previous trial credit policies not verified on current pricing pages | Claude Haiku 4.5 $1/$5, GPT-5.4 mini $0.75/$4.50 | Paid only; trying via OpenRouter / Vercel Gateway is more cost-effective |
Confirmed Shutdowns
- 01.AI Yi: English API shut down on 2025-08-25; international version no longer operational
Recommended Combinations
Side Projects / Toy Demos
Stack four providers as your primary setup — all free, no credit card required:
- Cerebras: Run large models like Qwen3-235B, gpt-oss-120b at top speed
- Groq: Run Llama 3.3 70B, Kimi K2, Whisper (speech) — broadest model lineup
- Cloudflare Workers AI: Run RAG / embedding, integrated with Workers / D1 / Vectorize
- Google AI Studio: Run Gemini 3 Flash for multimodal and long context experiments
Stack these four and it’s very hard to exhaust RPM/RPD limits during daily development.
Self-Hosted / Serverless GPU
- Modal: $30/month permanent credits to run your own vLLM/SGLang
- NVIDIA NIM: Free for dev (exact quota unclear), broadest model catalog, official optimizations
Fallback / Routing Convenience
OpenRouter :free + HF Inference Providers PRO + Vercel AI Gateway $5/month make the backup trio.
Production Paid (Cheapest Per-Token)
- DeepInfra (per-token king, but no free tier)
- Novita AI (includes audio/video, extremely competitive pricing)
- Groq (best of both speed and price)
- DeepSeek’s own v4-flash ($0.14/$0.28)
China Market
Stack four permanently free / daily free providers for minimum effort:
- Zhipu GLM-4.7-Flash: Permanently free, 200K context, no token cap (30 concurrent)
- iFlytek Spark Lite: Permanently free and unlimited
- Volcengine Doubao Collaboration Reward: Manually activate for 2M tokens/day auto-reset — best for volume
- Tencent Hunyuan-lite: Completely free + 1M tokens on first activation
New users can additionally stack: Qwen DashScope (1M per model / 90 days) + Baidu Qianfan (CNY 20 voucher + free Qwen) + Kimi (CNY 15 voucher) for enough credits to try out models.
Common Catches to Keep in Mind
- Free tiers typically collect your prompts for training / evaluation / safety analysis — use paid keys for production projects
- Model deprecation moves fast: 5/15 grok-4-1-fast retiring, 5/27 Cerebras Llama 3.1 8B / Qwen3-235B, 5/31 DeepSeek v4-pro discount ending — add these to your calendar
- RPM/RPD caps are per API key / organization — using multiple accounts to circumvent limits typically violates ToS
- “No credit card” does not equal “forever free”: All free tiers can be adjusted without notice — don’t skip feature flags
Overall, the good news in 2026 is that free resources are far more abundant than in 2024 — individual developers have no shortage of LLM APIs. The bad news is this market layer moves extremely fast, and any roundup from six months ago is likely already inaccurate. If you’re reading this six months after publication, I recommend clicking the official links below to verify again.
References
- Cerebras Inference Rate Limits
- Cerebras Inference Pricing
- Groq Rate Limits
- Groq Pricing
- Cloudflare Workers AI Pricing
- Google Gemini API Pricing
- Google Gemini API Rate Limits
- OpenRouter API Limits
- GitHub Models Prototyping Limits
- Hugging Face Inference Providers Pricing
- Vercel AI Gateway Pricing
- Cohere Rate Limits
- Together AI Pricing
- Fireworks AI Pricing
- DeepInfra Pricing
- Novita AI Pricing
- DeepSeek API Pricing
- xAI Models
- Perplexity API Pricing
- SambaNova Cloud Pricing
- Modal Pricing
- NVIDIA build.nvidia.com
- Nebius Token Factory
- Inference.net Pricing
- AI21 Pricing
- Pollinations.ai
- AI Horde
- iFlytek Spark API
- Tencent Hunyuan Pricing
- Baidu Qianfan
- Ollama
- Mistral La Plateforme Pricing
- Hyperbolic Docs
- MiniMax Platform Pricing
- Featherless AI
- Baseten Pricing
- Zhipu GLM BigModel Pricing
- Volcengine Doubao Free Quota
- Moonshot Kimi API Pricing
- Qwen DashScope Model Pricing
Loading...