Groq

inference

This item was not updated in last three versions of the Radar. Should it have appeared in one of the more recent editions, there is a good chance it remains pertinent. However, if the item dates back further, its relevance may have diminished and our current evaluation could vary. Regrettably, our capacity to consistently revisit items from past Radar editions is limited.

Mar 2026

Trial

Groq is a cloud inference provider that runs open-source models at exceptional speeds using purpose-built Language Processing Units (LPUs) — often 4x faster than GPU-based services. It's the right choice when latency or tokens-per-second are the bottleneck.

Buy vs Build

Groq is a pure buy: you call their API (OpenAI-compatible), pay per token, and don't manage any infrastructure. There is no self-hosted version of the LPU chip.

Why It's in Trial

Groq's LPU architecture is fundamentally different from GPU-based inference: while GPUs are optimised for parallel matrix math (training), LPUs are optimised for the sequential token generation pattern that defines LLM inference. The result:

276 tokens/second for Llama 3.3 70B — the fastest available in benchmarks
Sub-100ms first-token latency for interactive applications
No cold start delays (models are pre-loaded and always hot)

When Groq Matters

Most AI applications aren't bottlenecked on inference speed — users are patient enough to wait 3-5 seconds. Groq becomes the right tool when:

Real-time voice/chat applications: Response latency is user-facing
High-throughput pipelines: Processing thousands of documents or code files quickly
Agentic loops: Multi-step agent workflows where each step calls the LLM — latency compounds
Developer experience: Faster models mean faster iteration during development

Supported Models

Llama 4, Llama 3.3, Qwen, Gemma, DeepSeek, Mistral, and more. Groq doesn't serve Anthropic or OpenAI models — only open-weight models. For Claude or GPT-4, use the native APIs.

Getting Started

from groq import Groq

client = Groq()  # uses GROQ_API_KEY env variable
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain rate limiting"}]
)

Key Characteristics

Property	Value
Speed	Up to 276 tokens/second (Llama 3.3 70B)
Models	Open-weight only (Llama, Qwen, Mistral, etc.)
API format	OpenAI-compatible
Free tier	Generous free tier with rate limits
Provider	Groq Inc.
Website	groq.com
Docs	console.groq.com/docs