Ollama

inference

This item was not updated in last three versions of the Radar. Should it have appeared in one of the more recent editions, there is a good chance it remains pertinent. However, if the item dates back further, its relevance may have diminished and our current evaluation could vary. Regrettably, our capacity to consistently revisit items from past Radar editions is limited.

Mar 2026

Adopt

Ollama is the standard tool for running large language models locally on your own hardware — a single command downloads a model and starts an OpenAI-compatible API server. It's the fastest path from "I want a local model" to "I have a working local model."

Buy vs Build

Ollama is a build tool (you run it yourself) but abstracts away all the complexity. It's closer to "buy" in effort: ollama pull llama3.3 downloads and configures everything. There's no commercial hosted version.

Why It's in Adopt

Ollama is the de facto standard for local model development in 2026:

One command to run any model: ollama run llama3.3 downloads and starts chatting
Always-on API server: ollama serve runs a local OpenAI-compatible API at http://localhost:11434 — drop-in replacement for the OpenAI API in development
Tool calling: Full support for function/tool calling in supported models (Llama 3.1+, Mistral, Qwen 2.5)
MCP integration: Works with Model Context Protocol tools, enabling agentic workflows on local models
Cross-platform: Mac (Apple Silicon optimised), Linux, Windows

Why Engineering Managers Care

Cost control during development: Developers burning OpenAI credits running tests against real APIs is expensive. Ollama lets developers use local models for the 90% of work where cloud quality isn't needed — reserving cloud credits for production testing.

Data privacy: Source code, proprietary documents, and customer data never leave your network. Relevant for regulated industries or when working with sensitive IP.

Offline capability: Agents and tools work without internet access.

Performance on Apple Silicon

On M-series Macs, Ollama runs Llama 3.3 70B at 15-25 tokens/second with 64GB RAM — fast enough for interactive use. The 8B and 14B models run at 60+ tokens/second.

Getting Started

# Install (macOS)
brew install ollama

# Download and run a model
ollama run llama3.3

# Or use the API from code
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3", "messages": [{"role": "user", "content": "Hello"}]}'

Key Characteristics

Property	Value
Platforms	macOS, Linux, Windows
API format	OpenAI-compatible
Model format	GGUF (quantised models)
License	MIT
Provider	Ollama Inc.