Groq
inferenceGroq is a cloud inference provider that runs open-source models at exceptional speeds using purpose-built Language Processing Units (LPUs) — often 4x faster than GPU-based services. It's the right choice when latency or tokens-per-second are the bottleneck.
Buy vs Build
Groq is a pure buy: you call their API (OpenAI-compatible), pay per token, and don't manage any infrastructure. There is no self-hosted version of the LPU chip.
Why It's in Trial
Groq's LPU architecture is fundamentally different from GPU-based inference: while GPUs are optimised for parallel matrix math (training), LPUs are optimised for the sequential token generation pattern that defines LLM inference. The result:
- 276 tokens/second for Llama 3.3 70B — the fastest available in benchmarks
- Sub-100ms first-token latency for interactive applications
- No cold start delays (models are pre-loaded and always hot)
When Groq Matters
Most AI applications aren't bottlenecked on inference speed — users are patient enough to wait 3-5 seconds. Groq becomes the right tool when:
- Real-time voice/chat applications: Response latency is user-facing
- High-throughput pipelines: Processing thousands of documents or code files quickly
- Agentic loops: Multi-step agent workflows where each step calls the LLM — latency compounds
- Developer experience: Faster models mean faster iteration during development
Supported Models
Llama 4, Llama 3.3, Qwen, Gemma, DeepSeek, Mistral, and more. Groq doesn't serve Anthropic or OpenAI models — only open-weight models. For Claude or GPT-4, use the native APIs.
Getting Started
from groq import Groq
client = Groq() # uses GROQ_API_KEY env variable
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Explain rate limiting"}]
)
Key Characteristics
| Property | Value |
|---|---|
| Speed | Up to 276 tokens/second (Llama 3.3 70B) |
| Models | Open-weight only (Llama, Qwen, Mistral, etc.) |
| API format | OpenAI-compatible |
| Free tier | Generous free tier with rate limits |
| Provider | Groq Inc. |
| Website | groq.com |
| Docs | console.groq.com/docs |