- What Is Ollama
- Core Capabilities at a Glance
- Supported Models
- Recent New Features (2025-2026)
- Installation
- Environment Variables and Advanced Configuration
- CLI Commands
- API
- Hardware Requirements
- Modelfile Customization
- Importing Custom Models
- Comparison with Other Solutions
- Ecosystem
- Limitations and Caveats
- Debugging and Troubleshooting
- The Big Picture
- References
🌏 中文版
When running LLMs locally, the most common hurdles are: converting model formats, manually allocating GPU memory, and picking quantization parameters yourself. Ollama wraps all of this up, letting you download and launch a model with a single command. This post is a comprehensive look at Ollama’s design, usage, and real-world limitations.
What Is Ollama
Ollama is an open-source platform (MIT license) for running large language models on local machines. Under the hood it uses the llama.cpp inference engine, with a Docker-style CLI and REST API layered on top.
Core design philosophy: manage models like you manage containers. Model weights, configuration, and runtime environment are packaged into a unit called a Modelfile. Model layers are cached like container images, and shared layers don’t need to be downloaded again.
# This single line does three things: downloads the model, configures GPU, and starts an interactive chat
ollama run llama3.2
As of 2026 Q1, Ollama sees 52 million monthly downloads and has over 100,000 stars on GitHub.
Core Capabilities at a Glance
Ollama is more than just a CLI tool — it’s a complete local LLM runtime platform:
- One-command model management —
ollama run,ollama pull,ollama rm - Automatic GPU detection — NVIDIA CUDA, AMD ROCm, Apple Metal all auto-detected
- Automatic VRAM management — Multiple models loaded simultaneously; overflow to RAM when VRAM is exceeded
- OpenAI-compatible API —
localhost:11434/v1/can directly replace an OpenAI endpoint - Modelfile system — Configuration files similar to Dockerfiles
- Multimodal — Supports vision models (Gemma 3, Llama 3.2 Vision, LLaVA)
- Structured output — JSON Schema-constrained response formats
- Tool calling — Function calling support
- Embeddings — Built-in embedding endpoint
Supported Models
The full list is at ollama.com/library. Here are the highlights:
General chat: Llama 3.1/3.2/4 (Meta), Mistral/Mixtral (Mistral AI), Qwen 2.5/3 (Alibaba), Gemma 2/3 (Google), Phi-3/4 (Microsoft), GPT-OSS (OpenAI open-source models)
Reasoning: DeepSeek R1, DeepSeek-v3.1 (various distilled sizes)
Code: Qwen 2.5-Coder, CodeLlama, Qwen3-Coder
Vision: Gemma 3 (officially recommended), Llama 3.2 Vision, LLaVA
Embeddings: embeddinggemma, qwen3-embedding, all-minilm (official top three picks)
Models not in the official library can be manually imported as long as they’re in GGUF format.
Recent New Features (2025-2026)
Ollama has added several noteworthy features over the past year:
Thinking/Reasoning Mode
Supports thinking mode for models like Qwen 3, DeepSeek R1, DeepSeek-v3.1, and GPT-OSS. Responses are split into two fields: thinking (reasoning process) and content (final answer). You can choose to show or hide the reasoning chain.
# Enable thinking (on by default for compatible models)
ollama run deepseek-r1 --think "How many r's in strawberry?"
# Hide reasoning, show only the answer
ollama run deepseek-r1 --hidethinking "Explain quantum entanglement"
# Toggle in interactive mode
>>> /set think
>>> /set nothink
GPT-OSS is special — thinking isn’t a boolean but has levels (low/medium/high):
ollama run gpt-oss --think=low "Simple question"
At the API level, adding think: true to a chat or generate request causes message.thinking to include the reasoning content.
Tool Calling (Three Modes)
Ollama’s tool calling goes beyond single invocations, supporting three modes:
- Single — The model calls one tool; you execute it and feed the result back
- Parallel — The model calls multiple tools at once; you execute all of them and return results together
- Agent Loop — Multi-turn loop where the model decides when to call tools and when to stop
The Python SDK lets you pass function objects directly to the tools parameter, which are automatically parsed into schemas. JavaScript requires manually defining JSON Schemas.
Structured Output (JSON Schema)
Beyond just format: "json" returning arbitrary JSON, you can now use full JSON Schema to constrain response formats:
from ollama import chat
from pydantic import BaseModel
class Country(BaseModel):
name: str
capital: str
languages: list[str]
response = chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Tell me about Taiwan'}],
format=Country.model_json_schema(),
)
country = Country.model_validate_json(response.message.content)
On the JavaScript side, use Zod + zodToJsonSchema() for the same effect. Vision models also support structured output — you can use a schema to constrain the fields of image descriptions.
Web Search (Cloud Feature)
Note: This is not a local feature. It requires an Ollama account and API key (obtainable from ollama.com/settings/keys).
Two cloud APIs are provided:
POST https://ollama.com/api/web_search— Search queries, returning titles + URLs + summariesPOST https://ollama.com/api/web_fetch— Fetch full content from a specific URL
import ollama
response = ollama.web_search("Ollama latest version")
Combined with models like Qwen 3, you can build a search agent: the model autonomously decides when to search, when to fetch, and when to answer. The official recommendation is to use models with 32K+ context for search agents. Integration with tools like Cline and Codex is also available via MCP Server.
Ollama Cloud
Cloud hosting service launched in September 2025:
- Pro: $20/month
- Max: $100/month
Suitable for those who don’t want to manage hardware but want to use the Ollama ecosystem. Cloud models run at full context capacity. However, public documentation for rate limits, per-token billing, and enterprise SLAs is still lacking — it’s still in early stages.
TUI Interactive Interface + AI Tool Launcher (0.18.3)
This is the biggest positioning shift. Starting from 0.18, typing ollama in the terminal with no arguments opens an interactive TUI menu:
Ollama 0.18.3
▸ Run a model
Start an interactive chat with a model
Launch Claude Code
Anthropic's coding tool with subagents
Launch Codex
OpenAI's open-source coding agent
Launch OpenClaw
Personal AI with 100+ skills
Launch Visual Studio Code
Microsoft's open-source AI code editor
Launch Cline (not installed)
Install with: npm install -g cline
↑/↓ navigate • enter launch • → configure • esc quit
Ollama is no longer just a “local LLM runner” — it has become a unified entry point for AI development tools. The official documentation lists 18 integrated tools: Claude Code, Codex, Cline, OpenClaw, VS Code, JetBrains, Xcode, Zed, Roo Code, OpenCode, Droid, Pi, Goose, Marimo, n8n, NemoClaw, Onyx, and more. Tools that aren’t installed display installation commands.
This is a smart design move — Ollama is already the default choice for developers running local LLMs, and turning itself into an AI tool launcher means it’s competing for the entry point position in developer workflows.
Installation
macOS
# Or download .dmg from ollama.com/download
brew install ollama
Apple Silicon automatically enables Metal GPU acceleration with no extra configuration.
Linux
curl -fsSL https://ollama.com/install.sh | sh
Automatically installs the binary and sets up a systemd service.
Windows
winget install Ollama.Ollama
Docker
# CPU
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# NVIDIA GPU
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# AMD GPU
docker run -d --device /dev/kfd --device /dev/dri \
-v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm
macOS Docker Desktop does not support GPU passthrough — running the Docker version falls back to CPU. It’s recommended to install natively on macOS.
Environment Variables and Advanced Configuration
Ollama’s behavior is almost entirely controlled by environment variables. These are the ones you’ll eventually need:
Core Settings
| Variable | Purpose | Default |
|---|---|---|
OLLAMA_HOST | Bind address and port | 127.0.0.1:11434 |
OLLAMA_MODELS | Model storage path | ~/.ollama/models (macOS), /usr/share/ollama/.ollama/models (Linux) |
OLLAMA_ORIGINS | CORS allowed origins | 127.0.0.1, 0.0.0.0 |
OLLAMA_NO_CLOUD | Disable cloud features | Not set |
HTTPS_PROXY | Proxy for model downloads | Not set |
Performance Tuning
| Variable | Purpose | Default |
|---|---|---|
OLLAMA_CONTEXT_LENGTH | Global context window size | Auto-determined by VRAM |
OLLAMA_NUM_PARALLEL | Max parallel requests per model | 1 |
OLLAMA_MAX_LOADED_MODELS | Models loaded in memory simultaneously | GPU count x 3 (or 3 in CPU mode) |
OLLAMA_MAX_QUEUE | Request queue limit; returns 503 when exceeded | 512 |
OLLAMA_KEEP_ALIVE | How long a model stays in memory after idle | 5m |
OLLAMA_FLASH_ATTENTION | Enable Flash Attention (saves memory) | Not enabled |
OLLAMA_KV_CACHE_TYPE | KV Cache quantization type | f16 (options: q8_0 for half memory, q4_0 for quarter) |
OLLAMA_KEEP_ALIVE supports multiple formats: "10m", "24h", 0 (unload immediately after use), negative values (never unload). The memory overhead of OLLAMA_NUM_PARALLEL = parallel count x context length — setting it too high will exhaust memory.
Platform-Specific Configuration
# macOS — Set via launchctl, restart app to take effect
launchctl setenv OLLAMA_CONTEXT_LENGTH 64000
# Linux — Edit the systemd service
sudo systemctl edit ollama.service
# Add under the [Service] section:
# Environment="OLLAMA_CONTEXT_LENGTH=64000"
sudo systemctl daemon-reload && sudo systemctl restart ollama
# Windows — System Settings → Environment Variables, restart app
CLI Commands
Commands you’ll use day to day:
ollama serve # Start server (port 11434)
ollama run llama3.2 # Download + start interactive chat
ollama run llama3.2 "Explain the TCP three-way handshake" # One-shot question
ollama pull qwen2.5:14b # Download only, don't launch
ollama list # List downloaded models
ollama ps # See which models are in memory
ollama show llama3.2 # Model info (architecture, quantization, license)
ollama rm mistral # Delete a model
ollama stop llama3.2 # Unload from memory
You can adjust parameters in real time during interactive mode:
>>> /set parameter temperature 0.8
>>> /set system "You are a senior backend engineer"
>>> /set think # Enable reasoning mode
>>> /set nothink # Disable reasoning mode
>>> /show info
>>> /bye
API
Ollama provides two API sets at localhost:11434.
Native API
# Multi-turn chat
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "What is RAG?"}],
"stream": false
}'
# Generate embeddings
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Ollama is a local LLM platform"
}'
Other endpoints: /api/generate (text generation), /api/tags (list models), /api/pull (download models), /api/show (model info), /api/ps (running models).
Performance Metrics in API Responses
Every API response includes performance data (in nanoseconds):
{
"total_duration": 5589157167,
"load_duration": 3013701500,
"prompt_eval_count": 46,
"prompt_eval_duration": 1160282000,
"eval_count": 113,
"eval_duration": 1325948000
}
To calculate token generation speed: eval_count / eval_duration x 10^9 = tokens/sec. The example above gives 113 / 1.326 ≈ 85 tok/s. This number is useful for identifying hardware bottlenecks — if load_duration dominates, the model is being frequently unloaded and reloaded; consider increasing OLLAMA_KEEP_ALIVE.
Advanced Parameters: The options Object
Both /api/chat and /api/generate support an options object that lets you override model parameters at the per-request level:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Write a poem"}],
"stream": false,
"options": {
"temperature": 1.2,
"top_p": 0.95,
"num_ctx": 8192,
"seed": 42,
"repeat_penalty": 1.2
},
"keep_alive": "30m"
}'
seed combined with a fixed temperature produces reproducible output — useful for testing and debugging.
OpenAI-Compatible Endpoints
This is one of Ollama’s most practical features. Any code using the OpenAI SDK can switch to a local model just by changing base_url:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama", # Any string; Ollama doesn't validate
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Write a binary search in Python"}],
)
print(response.choices[0].message.content)
Supported endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models. Streaming, function calling, and structured output are all supported.
Hardware Requirements
Memory Reference Table (4-bit Quantization)
| Model Size | Required RAM/VRAM |
|---|---|
| 7B | ~4-5 GB |
| 13B | ~8-9 GB |
| 30B | ~16-20 GB |
| 70B | ~40+ GB |
Reserve 2-3 GB for the OS. Higher quantization levels (Q8, FP16) increase memory requirements by 2-4x.
GPU Support
NVIDIA (most complete): CUDA compute capability 5.0+. The RTX 4090 (24 GB) is the top consumer choice; the RTX 4060 (8 GB) is the budget option.
AMD: Supported on Linux via ROCm; experimental on Windows. The RX 7900 XTX (24 GB) works well, but some GPUs may need the HSA_OVERRIDE_GFX_VERSION environment variable.
Apple Silicon: Metal API is automatically enabled. The unified memory architecture advantage is that all system RAM is available to the GPU — M-series chips with 32 GB+ memory provide an excellent local LLM experience.
Context Length and VRAM Relationship
Ollama automatically determines context window size based on VRAM:
| Available VRAM | Default Context Length |
|---|---|
| < 24 GB | 4,000 tokens |
| 24-48 GB | 32,000 tokens |
| > 48 GB | 256,000 tokens |
For web search, agent tasks, and coding tool scenarios, the official recommendation is at least 64,000 tokens. Manual configuration:
# Global setting
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
# Per-request level (via options.num_ctx)
curl http://localhost:11434/api/chat -d '{
"model": "qwen3",
"messages": [...],
"options": {"num_ctx": 64000}
}'
Doubling context length doubles KV cache memory. Combining with OLLAMA_KV_CACHE_TYPE=q8_0 lets you run the same context at half the memory, at the cost of slightly reduced precision. q4_0 is even more efficient (quarter memory), but the quality impact is more noticeable.
When VRAM Isn’t Enough
Ollama automatically spills some layers to system RAM. The upside is it doesn’t crash; the downside is speed drops 5-30x.
# Check GPU/CPU allocation status
ollama ps
# NAME SIZE PROCESSOR CONTEXT
# llama3.2 4.9 GB 100% GPU 8192
PROCESSOR showing 100% GPU is the ideal state. If you see 50% GPU / 50% CPU, it means the model is partially running on CPU with significantly reduced performance. Solutions: use a smaller model, reduce context length, enable KV cache quantization, or upgrade hardware.
Modelfile Customization
Modelfile is one of Ollama’s killer features. The syntax resembles Dockerfile:
FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
SYSTEM """You are a senior software engineer. Always include code examples in your answers and explain your reasoning step by step."""
# Create a custom model
ollama create my-code-assistant -f ./Modelfile
# Use it
ollama run my-code-assistant
# View any model's Modelfile
ollama show --modelfile llama3.2
All Directives
| Directive | Purpose |
|---|---|
FROM | Base model (required). Can be a model name, local GGUF path, or safetensors directory |
SYSTEM | System prompt, injected into the template’s {{ .System }} |
PARAMETER | Inference parameters (see full list below) |
TEMPLATE | Custom prompt template (Go template syntax, variables: {{ .System }}, {{ .Prompt }}, {{ .Response }}) |
ADAPTER | Apply a LoRA adapter (safetensors directory or GGUF file) |
MESSAGE | Pre-fill conversation history, specifying role (system/user/assistant) to guide model behavior |
LICENSE | Declare license terms |
REQUIRES | Specify minimum Ollama version (e.g., REQUIRES 0.14.0) |
Complete Parameter Table
| Parameter | Description | Default |
|---|---|---|
temperature | Creativity; higher = more random | 0.8 |
num_ctx | Context window size (tokens) | 2048 |
num_predict | Max generated tokens (-1 = unlimited) | -1 |
top_k | Limit candidate tokens; lower = more deterministic | 40 |
top_p | Nucleus sampling threshold | 0.9 |
min_p | Minimum probability threshold | 0.0 |
repeat_penalty | Repetition penalty | 1.1 |
repeat_last_n | Lookback window for repetition detection | 64 |
seed | Random seed (with fixed temperature, enables reproducible output) | 0 |
stop | Stop sequences (can be set multiple times) | — |
Advanced Example: Pre-filling Conversations with MESSAGE
FROM llama3.2
SYSTEM """You are a Taiwan labor law consultant, specializing in the Labor Standards Act. Answer in Traditional Chinese."""
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
# Pre-fill few-shot examples with MESSAGE
MESSAGE user "How is overtime pay calculated?"
MESSAGE assistant "According to Article 24 of the Labor Standards Act, for extended working hours, the first 2 hours are paid at 1/3 above the regular hourly wage, and the next 2 hours at 2/3 above."
Modelfile lets you create different model configurations for different purposes — one specialized for coding, one for translation, one for RAG — without re-downloading model weights. You’re just applying different settings on top of the same base model.
Importing Custom Models
For models not in the Ollama library, there are three import methods:
GGUF Files (Most Common)
Community-quantized GGUF files on HuggingFace can be used directly:
# Modelfile
FROM ./my-model-q4_K_M.gguf
SYSTEM "Your system prompt"
ollama create my-model -f Modelfile
ollama run my-model
Safetensors (Full Models or Adapters)
Import a full model directly:
FROM /path/to/safetensors/directory
Or import a LoRA adapter (output from fine-tuning with tools like Unsloth or MLX):
FROM llama3.2
ADAPTER /path/to/adapter/directory
Important: The FROM base model must be exactly the same one you used when training the adapter, otherwise results will be unpredictable.
Quantization
You can quantize when importing FP16/FP32 models:
ollama create --quantize q4_K_M my-model -f Modelfile
Supported quantization types: q4_K_S, q4_K_M (recommended — good balance of quality and size), q8_0.
Sharing Models
ollama cp my-model myuser/my-model
ollama push myuser/my-model
# Others can then run: ollama run myuser/my-model
Comparison with Other Solutions
| Ollama | llama.cpp | LM Studio | vLLM | |
|---|---|---|---|---|
| Interface | CLI + REST API | Pure CLI | GUI + API | Server API |
| Installation | One command | Requires compilation | Installer | pip install |
| Open source | MIT | MIT | No (free) | Apache 2.0 |
| Best for | Developers, API integration | Maximum performance control | Beginners, GUI preference | Production high-throughput |
| GPU management | Automatic | Fully manual | GUI controls | Auto-optimized |
How to choose?
- Want fastest setup + API-driven development → Ollama
- Want a GUI you can point and click → LM Studio
- Need maximum performance control and customization → llama.cpp
- Need production deployment with high concurrency → vLLM
Both Ollama and LM Studio use llama.cpp under the hood. Ollama wins on automatic VRAM management and developer-friendly APIs; LM Studio wins on UI and model discovery experience.
Ecosystem
Ollama’s official documentation lists 18 integrated tools, and the community ecosystem is already quite mature:
| Category | Tools |
|---|---|
| Web UI | OpenWebUI (the most ChatGPT-like local interface) |
| AI Coding | Claude Code, Codex, Cline, Roo Code, OpenCode, Droid, Pi, Goose |
| IDE | VS Code, JetBrains, Xcode, Zed |
| Automation | n8n, Marimo |
| Personal Assistant | OpenClaw, NemoClaw, Onyx |
| RAG Frameworks | LangChain, LlamaIndex |
Common pairings: Ollama + OpenWebUI for a local chat interface, Ollama + LangChain for local RAG, Ollama + Claude Code/Codex for using local models as coding agents. The TUI launcher in version 0.18 makes switching between these tools even more seamless.
Limitations and Caveats
Not a Production Solution
Ollama is designed for local development and experimentation, not production deployment. There’s no built-in load balancing, horizontal scaling, or observability. Request queuing is silent — it won’t reject requests, just silently increases latency with no warnings.
Security Is a Major Concern
There’s no authentication by default. If you set OLLAMA_HOST to 0.0.0.0, the API is open to everyone. In January 2026, reports identified 175,000 exposed Ollama servers being exploited. Any non-localhost deployment must include a reverse proxy + authentication.
Model Quality Ceiling
Open-source models running locally still generally can’t match cloud APIs like Claude or GPT-4o on complex reasoning tasks. 7B models are suitable for simple tasks; 70B approaches cloud quality, but hardware requirements scale accordingly.
Other Limitations
- Cannot select specific quantization methods (Ollama decides automatically)
- Models are stored in a proprietary blob format, making cross-tool sharing inconvenient (unlike using GGUF directly)
- Inference only, no fine-tuning (though you can apply LoRA adapters)
- No built-in GUI — requires third-party frontends like OpenWebUI
- AMD GPU support is less mature than NVIDIA
Debugging and Troubleshooting
Where Are the Logs
# macOS
cat ~/.ollama/logs/server.log
# Linux (systemd)
journalctl -u ollama --no-pager --follow
# Docker
docker logs ollama
# Windows
# %LOCALAPPDATA%\Ollama (server log)
# %HOMEPATH%\.ollama (models and config)
To enable debug mode on Windows: $env:OLLAMA_DEBUG="1" before launching.
GPU Not Detected
NVIDIA: Verify nvidia-smi runs. In Docker, test with docker run --gpus all ubuntu nvidia-smi. If the UVM driver isn’t loaded: sudo nvidia-modprobe -u. Advanced diagnostics: CUDA_ERROR_LEVEL=50.
AMD: The user must be in the video and render groups to access /dev/kfd. Docker containers need --group-add with the corresponding GID. ROCm versions below v6 may cause timeouts — upgrade to v7.
Common Issues
Model gradually slows down: Check the PROCESSOR column with ollama ps. If it changes from 100% GPU to a GPU/CPU mix, memory is insufficient and spilling has started. Reduce num_ctx or switch to a smaller model.
CPU fallback but you don’t want it: Force a specific LLM library: OLLAMA_LLM_LIBRARY="cpu_avx2" ollama serve. Priority order: cpu_avx2 > cpu_avx > cpu (most compatible; works with macOS Rosetta too).
GPU not working in Docker: Check /etc/docker/daemon.json and confirm it has "exec-opts": ["native.cgroupdriver=cgroupfs"].
/tmp mounted as noexec: Set OLLAMA_TMPDIR to point to another directory.
Installing a specific version:
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.7 sh
The Big Picture
Ollama’s core trade-off is clear: sacrifice a layer of abstraction for developer experience. You give up llama.cpp’s fine-grained control in exchange for one-command model execution + OpenAI-compatible API + automatic GPU management.
From 2025 to 2026, Ollama’s positioning has expanded from “local LLM runner” to “unified entry point for AI developers.” Thinking mode, tool calling agent loops, structured output, web search, TUI launcher — these features combined have transformed it from a simple inference tool into a development platform.
Good use cases: local development and testing of LLM applications, cost-saving prototype development, privacy-sensitive offline usage, experimenting with RAG frameworks, importing custom fine-tuned models for inference.
Not-so-good use cases: high-concurrency production environments (use vLLM), maximum performance tuning needed (use llama.cpp), non-technical users (use LM Studio).
If you’re a developer who wants to run LLMs locally for development and testing, Ollama is currently the lowest-friction option.
References
- Ollama Official Documentation — Complete reference for Ollama CLI, API, and Modelfile
- Ollama GitHub Repository — Source code, issue tracker, and release notes
- llama.cpp GitHub Repository — Ollama’s underlying inference engine, a C++ implementation supporting GGUF format
- Searching for Best Practices in Retrieval-Augmented Generation — Wang et al. (2024), research on best practices for local LLMs with RAG
- OpenWebUI GitHub Repository — The most commonly paired open-source Web UI for Ollama
- vLLM Documentation — Official vLLM docs, for comparison as a production alternative to Ollama
- Meta Llama Official Page — Meta’s official Llama model licensing and technical details
- Ollama — Modelfile Documentation — Complete Modelfile syntax specification, including all directives and parameter descriptions
Loading...