Skip to content

2026 Q1 Open-Source LLM Landscape: From Frontier Models to On-Device, a Complete Survey

Mar 31, 2026 1 min
TL;DR 2026 Q1 saw a full-blown open-source model explosion: on the LLM front, GLM-5, Kimi K2.5, and Qwen3.5 caught up with closed-source models; Embedding and Reranker are dominated by Qwen3 and BGE; speech has Voxtral TTS and Whisper V3; image has FLUX.2; and video has Wan 2.2 rivaling Sora. This is the complete navigation map.
Table of Contents
  1. 2026 Q1 Open-Source Model Release Timeline
  2. Frontier Open-Source Models (100B+)
    1. GLM-5 / GLM-5.1 (Zhipu AI) — Highest-Ranked Open-Source
    2. Kimi K2.5 (Moonshot AI) — 1T Parameters + Agent Swarm
    3. Qwen3.5-397B-A17B (Alibaba) — Multimodal Flagship
    4. MiniMax-M2.5 — SWE-bench Rivaling Claude Opus
    5. DeepSeek V3 / R1 — Still Important but Starting to Age
    6. Llama 4 Scout / Maverick (Meta) — Ultra-Long Context
    7. gpt-oss-120b (OpenAI) — First Open-Source Model
    8. Devstral 2 (Mistral) — Code-Specialized
    9. Frontier Tier Overview
  3. Mid-Tier Models (7B—70B)
    1. Qwen3.5 Medium Series (2026/02)
    2. Mistral Small 4 (2026/03)
    3. Gemma 3 (Google, 2025/03)
    4. Devstral Small 2 (Mistral, 2025/12)
    5. InternLM3-8B (Shanghai AI Lab, 2026/01)
    6. Other Widely Used Models
  4. Mobile Small Models (Below 7B)
    1. Qwen3.5 Small Series (2026/03)
    2. Gemma 3n (Google, 2025/05—07)
    3. Other Notable Small Models
  5. Embedding Models: The Foundation of RAG
  6. Reranker Models: Boosting Retrieval Precision
  7. Code Models: Specialized for Writing Code
  8. Speech Models: STT and TTS
    1. Speech-to-Text (STT)
    2. Text-to-Speech (TTS)
  9. Image Generation Models
  10. Video Generation Models
  11. Deployment and Inference: How to Run the Models
    1. Local Development -> Ollama
    2. Production Deployment -> vLLM
    3. Edge / Serverless -> Cloudflare Workers AI
    4. Deployment Quick Reference
  12. Leaderboard Status (2026/03)
    1. LMArena (formerly Chatbot Arena)
    2. Artificial Analysis Intelligence Index
  13. How to Choose a Model: Decision Framework
    1. Step 1 — Identify Your Use Case
    2. Step 2 — Identify Your Constraints
    3. Step 3 — Just Try It
  14. How to Track the Latest Models
    1. Leaderboards / Comparison Sites
    2. Real-Time Tracking
    3. Community
  15. The Big Picture
  16. References

🌏 中文版

The pace of open-source LLM progress in Q1 2026 is too fast for any single article to cover completely. In Q1 alone, over 15 major models were released — frontier-tier MoE models broke through 1T parameters, mid-tier efficiency improved dramatically (Qwen3.5-35B achieves better performance than the previous-gen 235B flagship with only 3B active parameters), and mobile small models were compressed down to 140M for on-device inference.

This article is a navigation map. For each category, I’ll lay out the current landscape and key metrics, with details linked to dedicated articles — those cover the full technical architecture, benchmark comparisons, and hands-on experience.

2026 Q1 Open-Source Model Release Timeline

Let’s start with the big picture. Here are the major open-source model releases from January to March 2026:

DateModelDeveloperHighlights
01/15InternLM3-8BShanghai AI Lab8B, 4x data efficiency vs Llama 3.1
01/27Kimi K2.5Moonshot AI1T/32B MoE, Agent Swarm, MIT
02/11GLM-5Zhipu AI745B/44B MoE, MIT, #1 open-source
02/11MiniMax-M2.5MiniMax230B/10B MoE, SWE-bench 80.2%
02/16Qwen3.5-397BAlibaba397B/17B MoE, native multimodal, Apache 2.0
02/24Qwen3.5 MediumAlibaba122B/35B/27B three sizes
FebTiny AyaCohere3.35B, 70+ languages
03/01Qwen3.5 SmallAlibaba0.8B—9B, mobile multimodal
03/15GLM-5-TurboZhipu AIAccelerated version of GLM-5
03/16Mistral Small 4Mistral AI119B/6.5B MoE, unified reasoning + multimodal + code
03/26Voxtral TTSMistral AIOpen-source speech synthesis, 9 languages
03/27GLM-5.1Zhipu AIImproved GLM-5, code ability approaching Claude Opus

Frontier Open-Source Models (100B+)

Nearly all frontier open-source models in 2026 use the Mixture of Experts (MoE) architecture — massive total parameter counts, but only a small fraction activated per inference, striking a balance between performance and cost. Chinese labs dominate this tier.

GLM-5 / GLM-5.1 (Zhipu AI) — Highest-Ranked Open-Source

745B total parameters, 44B active parameters (256 experts, 8 activated per token), MIT license. Trained entirely on Huawei Ascend chips, without a single NVIDIA GPU. Artificial Analysis Intelligence Index #1 open-source (50 points), SWE-bench Verified 77.8%, Humanity’s Last Exam 50.4%. API pricing around $1.00/$3.20 per M tokens.

GLM-5.1, released on March 27, further improved performance, with code ability approaching Claude Opus 4.6.

-> GLM-5 Deep Dive: Zhipu AI’s 744B Open-Source Model

Kimi K2.5 (Moonshot AI) — 1T Parameters + Agent Swarm

A 1T total / 32B active parameter MoE model, MIT license (attribution required for >100M MAU or >$20M monthly revenue). Native multimodal, with the biggest highlight being Agent Swarm — capable of coordinating 100 sub-agents simultaneously and issuing 1,500 tool calls. Code and math are open-source strongest on some benchmarks.

-> Kimi Deep Dive: Moonshot AI’s Long-Context AI Model

Qwen3.5-397B-A17B (Alibaba) — Multimodal Flagship

397B total / 17B active parameters, hybrid MoE (Gated Delta Networks + sparse MoE, 512 experts, 10 routed + 1 shared per token). 262K context (expandable to 1M), native multimodal (text/image/video), supports 201 languages, Apache 2.0 license.

After the Qwen3 series (235B flagship) in April 2025, Alibaba released the fully upgraded 3.5 series in just half a year — the evolution speed is remarkable.

MiniMax-M2.5 — SWE-bench Rivaling Claude Opus

230B total / 10B active parameters MoE model, Modified MIT license. SWE-bench Verified 80.2% (matching Claude Opus 4.6), Multi-SWE-Bench #1 (51.3%), at 1/20th the cost of Claude Opus. Trained with their proprietary Forge RL framework.

MiniMax is relatively low-profile, but M2.5’s code ability is among the very best in open-source models.

DeepSeek V3 / R1 — Still Important but Starting to Age

DeepSeek-V3 (671B/37B MoE, 2024/12) and R1 (reasoning-specialized, 2025/01) were open-sourced under MIT, with training cost of only $5.6M, sending shockwaves through the industry. V3.1 (2025/08) merged V3 and R1 capabilities with hybrid thinking mode. V3.2-Exp (2025/09) introduced Sparse Attention.

But as of 2026/03/31, neither R2 nor V4 have been released — the originally planned May 2025 schedule has been delayed multiple times. CEO Liang Wenfeng reportedly wasn’t satisfied with R2’s performance and may be retraining on Huawei chips.

Llama 4 Scout / Maverick (Meta) — Ultra-Long Context

Released April 2025. Scout (109B/17B MoE, 16 experts) has a 10M context window, Maverick (400B/17B MoE, 128 experts) has 1M context. Llama 4 Community License (open-weight but not fully open-source).

Multimodal (text + image input), multilingual support, but Chinese capability still lags behind Qwen and GLM.

gpt-oss-120b (OpenAI) — First Open-Source Model

Released August 2025, 117B total / 5.1B active parameters MoE, Apache 2.0 license. Reasoning capability close to o4-mini, can run on a single 80GB GPU. Also released gpt-oss-20b (21B/3.6B).

OpenAI releasing an open-source model is itself a historic event, even though it’s no longer the most capable.

Devstral 2 (Mistral) — Code-Specialized

123B dense model, 256K context, Modified MIT. SWE-bench Verified 72.2%, 7x more cost-efficient than Claude Sonnet. Designed specifically for code generation and agentic coding.

Frontier Tier Overview

ModelTotal ParamsActive ParamsArchitectureLicenseReleasedStrength
GLM-5.1745B44BMoEMIT2026/03#1 overall open-source
Kimi K2.51T32BMoEMIT*2026/01Agent Swarm
Qwen3.5-397B397B17BMoEApache 2.02026/02Multimodal + multilingual
MiniMax-M2.5230B10BMoEModified MIT2026/02Best SWE-bench
DeepSeek-V3.1671B37BMoEMIT2025/08Lowest cost
Llama 4 Maverick400B17BMoELlama 42025/041M context
Llama 4 Scout109B17BMoELlama 42025/0410M context
gpt-oss-120b117B5.1BMoEApache 2.02025/08OpenAI open-source
Devstral 2123B123BDenseModified MIT2025/12Code-specialized

Mid-Tier Models (7B—70B)

Not every application needs a frontier model. 7B—70B models can run on a single machine or a few GPUs, hitting the sweet spot for many production scenarios.

Qwen3.5 Medium Series (2026/02)

Alibaba released three sizes on February 24, all Apache 2.0 with native multimodal:

ModelTotal ParamsActive ParamsArchitectureHighlights
Qwen3.5-122B-A10B122B10BMoEBest agentic benchmark (BFCL-V4 72.2)
Qwen3.5-35B-A3B35B3BMoESurpasses previous-gen 235B flagship
Qwen3.5-27B27B27BDenseSWE-bench Verified 72.4, matching GPT-5 mini

Qwen3.5-35B-A3B is particularly noteworthy — 3B active parameters surpassing the previous-gen flagship with 22B active parameters represents a massive leap in architectural efficiency.

Mistral Small 4 (2026/03)

119B total / 6.5B active parameters MoE, 256K context, Apache 2.0. Unifies Magistral (reasoning), Pixtral (multimodal), and Devstral (code) into a single model with adjustable reasoning intensity.

Gemma 3 (Google, 2025/03)

Based on Gemini 2.0 technology, available in 1B/4B/12B/27B sizes, multimodal (text + image), 128K context. Traditional Chinese performance on Cloudflare Workers AI outperforms Llama.

-> Gemma 3 on Cloudflare Workers AI: A Pragmatic Choice for Traditional Chinese Applications

Devstral Small 2 (Mistral, 2025/12)

24B dense, Apache 2.0. SWE-bench Verified 68.0%, runs on consumer hardware. If you only need code capability and hardware is limited, this is the best value option.

InternLM3-8B (Shanghai AI Lab, 2026/01)

8B parameters, 4x data efficiency vs Llama 3.1 (trained on 4T tokens), integrates conversational and deep thinking modes, performance matching GPT-4o-mini.

Other Widely Used Models

  • Llama 3.1 70B / 3.3 70B (2024): Most mature English ecosystem, 128K context
  • Qwen3-32B / 14B / 8B (2025/04): Apache 2.0, strong in Chinese and multilingual

Mobile Small Models (Below 7B)

1B—4B parameter models, after quantization, can achieve usable inference speeds on regular smartphones. The most important Q1 2026 advances are the Qwen3.5 Small series and continuously improving inference frameworks.

Qwen3.5 Small Series (2026/03)

0.8B / 2B / 4B / 9B four sizes, hybrid architecture (Gated DeltaNet + MoE), 262K context, native multimodal (4B and above), 201 languages, Apache 2.0. The 9B version even surpasses the previous-gen Qwen3-30B.

The top choice for Traditional Chinese and multilingual on-device scenarios.

Gemma 3n (Google, 2025/05—07)

Designed specifically for mobile, Per-Layer Embeddings (PLE) let a 5B parameter model occupy only 2GB RAM. E2B/E4B two sizes, multimodal (text/image/audio/video), optimized in partnership with Qualcomm, MediaTek, and Samsung.

Other Notable Small Models

ModelParametersLicenseStrength
Llama 3.21B / 3BLlama CommunityMost mature English ecosystem, 128K context
Phi-4-mini-flash-reasoning3.8BMITMath reasoning, 10x throughput
SmolLM33BApache 2.0Fully open-source (including training data), 128K context
MobileLLM-R1140M—950MMITBest sub-billion reasoning
Tiny Aya3.35BCC-BY-NC70+ languages, edge devices
gpt-oss-20b21B / 3.6B activeApache 2.0OpenAI small open-source model

-> Mobile Small Models Complete Comparison: Choices and Constraints in 2026

Embedding Models: The Foundation of RAG

RAG requires not just generation models but also Embedding models to vectorize text. This domain changed significantly in 2026 — Qwen3-Embedding took #1 on MTEB multilingual, Jina v4 supports multimodal embeddings, and Nomic v2 became the first to apply MoE to embedding models.

ModelDeveloperParamsDimensionsMax TokensMultilingualLicenseStrength
Qwen3-Embedding-8BAlibaba8B716832K100+ languagesApache 2.0MTEB multilingual #1 (70.58)
BGE-M3BAAI568M10248K100+ languagesMITOnly model supporting dense + sparse + ColBERT tri-mode
Jina Embeddings v4Jina AI3.8B204832K30+ languagesCC-BY-NC-4.0Multimodal (text + image + PDF)
NV-Embed-v2NVIDIA7.85B409632KPrimarily EnglishCC-BY-NC-4.0High MTEB English score
Nomic Embed v2Nomic AI475M (MoE)768512~100 languagesApache 2.0First MoE Embedding, fully open-source
EmbeddingGemma-300MGoogle300M7682K100+ languagesGemmaEdge deployment, <200MB RAM

How to choose: For Traditional Chinese RAG, use BGE-M3 (MIT, tri-mode retrieval) or Qwen3-Embedding (highest accuracy). For multimodal embedding (images + PDF), use Jina v4. For edge devices, use EmbeddingGemma.

-> BGE-M3: Why This Embedding Model Fits Traditional Chinese RAG

Reranker Models: Boosting Retrieval Precision

Embedding handles recall; Reranker handles precision ranking. A good Reranker can dramatically improve RAG answer quality.

ModelDeveloperParamsLicenseStrength
Qwen3-Reranker (0.6B/4B/8B)Alibaba0.6B—8BApache 2.0Full pipeline with Qwen3-Embedding
BGE Reranker v2-m3BAAI568MMITPairs with BGE-M3, most permissive license
Jina Reranker v3Jina AI0.6BCC-BY-NC-4.0131K context, cross-document interaction
gte-reranker-modernbert-baseAlibaba149MApache 2.0149M matches 1.2B Nemotron

Most pragmatic combinations: BGE-M3 + BGE Reranker v2-m3 (all MIT) or Qwen3-Embedding + Qwen3-Reranker (all Apache 2.0).

-> Cross-Encoder Reranking: Getting the Most Relevant Documents to the Top -> ColBERT: The Third Path for Vector Search -> SPLADE: Smarter Sparse Vector Search Than BM25

Code Models: Specialized for Writing Code

Most frontier LLMs can write code, but some models are specifically optimized for coding.

ModelDeveloperParamsLicenseHighlights
Qwen3-Coder-480B-A35BAlibaba480B/35B MoEApache 2.0Open-source coding flagship
Qwen2.5-Coder-32BAlibaba32BApache 2.0HumanEval 92.7% (surpasses GPT-4o)
Devstral 2Mistral123B denseModified MITSWE-bench 72.2%, 256K context
Devstral Small 2Mistral24BApache 2.0SWE-bench 68%, runs on consumer hardware
StarCoder2-15BBigCode15BOpenRAIL-M600+ programming languages, widest coverage
DeepSeek-Coder-V2DeepSeek236B/21B MoEDeepSeek License128K context, code + math

Speech Models: STT and TTS

Speech-to-Text (STT)

ModelDeveloperParamsWERLanguagesLicense
Canary Qwen 2.5BNVIDIA2.5B5.63%EnglishCC-BY-4.0
Granite Speech 3.3 8BIBM~9B5.85%Multilingual + translationApache 2.0
Whisper Large V3OpenAI1.55B7.4%99+ languagesMIT
Whisper Large V3 TurboOpenAI809M7.75%99+ languagesMIT
Parakeet TDT 1.1BNVIDIA1.1B~8%EnglishCC-BY-4.0

For multilingual, go with Whisper V3 (MIT, 99+ languages). For highest English accuracy, use Canary Qwen. For real-time streaming, use Parakeet TDT (>2000x real-time speed).

Text-to-Speech (TTS)

ModelDeveloperParamsLanguagesLicenseStrength
Voxtral TTSMistral4B9 languagesApache 2.03-second voice cloning, 70ms latency, released 2026/03
Kokorohexgrad82MMultilingualApache 2.0High-quality synthesis at 82M, runs on CPU
Fish Speech V1.5Fish AudioChinese/English multilingual300K hours training, DualAR architecture
Parler TTSHugging Face~600MEnglishApache 2.0Prompt-controllable tone and style

Voxtral TTS is the standout release of 2026/03 — 3 seconds of audio is all it needs to clone a voice, supports streaming, Apache 2.0 license.

Image Generation Models

ModelDeveloperParamsLicenseStrength
FLUX.2 DevBlack Forest Labs32BOpen weights (non-commercial)Best text rendering and characters
FLUX.2 Klein 4BBlack Forest Labs4BApache 2.0Consumer GPU instant generation
Stable Diffusion 3.5 LargeStability AI8.1BFree for revenue <$1MRuns on 12GB VRAM
SDXLStability AI~3.5BCreativeML Open RAIL-MMost mature ecosystem (LoRA, ControlNet)

Video Generation Models

ModelDeveloperParamsLicenseStrength
Wan 2.2Alibaba5B/14BApache 2.0Cinema-grade quality, 5B runs on consumer GPU
HunyuanVideoTencent13BTencent LicenseQuality rivaling Runway Gen-3
Open-Sora 2.0HPC-AI Tech11BOpen-sourceTraining cost only $200K, approaching OpenAI Sora
Mochi 1Genmo10BApache 2.0Commercial-friendly

Open-source video generation made the most dramatic progress in 2026 — Wan 2.2 and HunyuanVideo quality directly rivals Sora and Veo.

Deployment and Inference: How to Run the Models

After choosing a model, you still need to choose how to run it. The 2026 inference ecosystem is quite mature, but optimal choices vary significantly by scenario.

Local Development -> Ollama

Download and launch models with a single command, Docker-style CLI + OpenAI-compatible API. Ideal for personal development, prototyping, and offline use. Not suitable for high-concurrency production environments.

-> Ollama Complete Guide: Run LLMs Locally with One Command

Production Deployment -> vLLM

PagedAttention + continuous batching + prefix caching — currently the most mainstream open-source LLM inference engine. Ideal for API services requiring high throughput.

-> vLLM: From PagedAttention to Production-Grade LLM Inference Engine

Edge / Serverless -> Cloudflare Workers AI

If you don’t want to manage GPUs, Cloudflare Workers AI provides zero-ops inference services. Model selection is limited, but Gemma 3 12B outperforms Llama for Traditional Chinese.

-> Gemma 3 on Cloudflare Workers AI: A Pragmatic Choice for Traditional Chinese Applications

Deployment Quick Reference

What do you need?
├── Local experiments / prototyping -> Ollama
├── Production API service         -> vLLM (self-managed GPU) or Cloudflare Workers AI (managed)
├── Mobile app                     -> llama.cpp + GGUF or Google AI Edge
└── Offline / privacy              -> Ollama or on-device models

Leaderboard Status (2026/03)

The gap between open-source and closed-source models is closing rapidly, though frontier closed-source models still lead.

LMArena (formerly Chatbot Arena)

5,632,160 votes, 333 models. Top 9 are all closed-source (Claude Opus 4.6 leads at 1504 Elo), with GLM-5 family and Kimi K2.5 as the highest-ranked open-source models.

Artificial Analysis Intelligence Index

314 models. Highest closed-source score: 57 (Gemini 3.1 Pro Preview), highest open-source score: 50 (GLM-5 Reasoning) — the gap has narrowed from 20+ points a year ago to just 7 points.

Fastest model: Mercury 2 (789.2 tok/s). Cheapest model: Gemma 3n E4B ($0.03/M tokens).

How to Choose a Model: Decision Framework

Step 1 — Identify Your Use Case

ScenarioRecommended Direction
General conversation / ChatbotGLM-5.1, Kimi K2.5, Qwen3.5-397B
Code generation / Agentic CodingMiniMax-M2.5, GLM-5.1, Devstral 2
Traditional Chinese RAGGemma 3 12B (generation) + BGE-M3 (Embedding)
Multilingual applicationsQwen3.5 series (201 languages)
Mobile offlineGemma 3n, Qwen 3.5 Small
Math / ReasoningPhi-4-mini (mobile), DeepSeek-R1 (cloud)
Ultra-long textLlama 4 Scout (10M context), Kimi K2.5
Extremely low budgetCloudflare Workers AI (free tier) + Gemma 3

Step 2 — Identify Your Constraints

  • GPU resources: No GPU -> Ollama CPU mode or Cloudflare Workers AI; Have GPU -> vLLM
  • Language needs: Traditional Chinese -> Qwen or Gemma first; English -> Llama or Phi first
  • License needs: Commercial use -> MIT / Apache 2.0 (GLM-5, Qwen3.5, SmolLM3); watch out for Llama and Modified MIT restrictions
  • Privacy needs: Can’t send to cloud -> local deployment or on-device

Step 3 — Just Try It

Benchmarks and real-world performance don’t always align. Running a test with your own data is more useful than reading any leaderboard. Ollama lets you evaluate a model in five minutes — the barrier is too low to have any excuse not to try.

How to Track the Latest Models

Open-source models iterate extremely fast, and by the time this article is published, there might already be something new next week. Here are the channels I regularly follow:

Leaderboards / Comparison Sites

  • Artificial Analysis: Independent measurement, 72-hour update cycle, 314+ models, includes speed (tokens/sec) and price comparisons, filterable by model size — especially useful for tracking cost-effectiveness
  • LiveBench: Monthly new questions from latest arXiv papers and news, avoids benchmark gaming, covers math, code, and reasoning
  • LMArena (formerly Chatbot Arena): Crowd-sourced blind A/B comparisons producing Elo ratings. Closer to “how it actually feels to use” than benchmarks
  • LiveBench: Monthly new questions from latest arXiv papers and news, avoids benchmark gaming

Real-Time Tracking

Community

  • r/LocalLLaMA: The most active local model community on Reddit; first-hand benchmarks, quantized versions, and hands-on reviews mostly originate here
  • The comment sections on Hugging Face Daily Papers also frequently feature in-depth discussions between model authors and the community

Recommended strategy: Track new releases with LLM Stats -> check community reaction on Hugging Face Trending -> compare numbers on Artificial Analysis / LMArena -> read hands-on feedback on r/LocalLLaMA. But ultimately, you still need to test with your own data — benchmarks and real-world performance don’t always align.

The Big Picture

Five structural trends defined the open-source LLM landscape in Q1 2026:

  1. MoE dominates: Nearly all frontier models are MoE, with active parameters controlled between 10—44B; inference cost no longer scales linearly with total parameter count
  2. Chinese labs lead: GLM-5, Kimi K2.5, Qwen3.5, MiniMax-M2.5, DeepSeek — 4 out of the top 5 frontier open-source models come from China
  3. Native multimodal: Qwen3.5, Gemma 3n, Kimi K2.5 all integrate vision from pre-training, no longer bolted-on adapters
  4. MIT/Apache 2.0 as standard: Frontier open-source model licenses are becoming increasingly permissive, dramatically lowering the commercial use barrier
  5. Open-source catching up with closed-source: GLM-5 scored 50 on Artificial Analysis, closed-source highest is 57 — a year ago the gap was over 20 points

The biggest structural shift: Model selection decisions are moving from “open-source vs closed-source” to “self-hosted vs managed.” Technical capability is no longer the bottleneck — operational capability is.


References

In-site articles:

External resources: