Technology RadarTechnology Radar
This item was not updated in last three versions of the Radar. Should it have appeared in one of the more recent editions, there is a good chance it remains pertinent. However, if the item dates back further, its relevance may have diminished and our current evaluation could vary. Regrettably, our capacity to consistently revisit items from past Radar editions is limited.
Trial

Groq is a cloud inference provider that runs open-source models at exceptional speeds using purpose-built Language Processing Units (LPUs) — often 4x faster than GPU-based services. It's the right choice when latency or tokens-per-second are the bottleneck.

Buy vs Build

Groq is a pure buy: you call their API (OpenAI-compatible), pay per token, and don't manage any infrastructure. There is no self-hosted version of the LPU chip.

Why It's in Trial

Groq's LPU architecture is fundamentally different from GPU-based inference: while GPUs are optimised for parallel matrix math (training), LPUs are optimised for the sequential token generation pattern that defines LLM inference. The result:

  • 276 tokens/second for Llama 3.3 70B — the fastest available in benchmarks
  • Sub-100ms first-token latency for interactive applications
  • No cold start delays (models are pre-loaded and always hot)

When Groq Matters

Most AI applications aren't bottlenecked on inference speed — users are patient enough to wait 3-5 seconds. Groq becomes the right tool when:

  • Real-time voice/chat applications: Response latency is user-facing
  • High-throughput pipelines: Processing thousands of documents or code files quickly
  • Agentic loops: Multi-step agent workflows where each step calls the LLM — latency compounds
  • Developer experience: Faster models mean faster iteration during development

Supported Models

Llama 4, Llama 3.3, Qwen, Gemma, DeepSeek, Mistral, and more. Groq doesn't serve Anthropic or OpenAI models — only open-weight models. For Claude or GPT-4, use the native APIs.

Getting Started

from groq import Groq

client = Groq()  # uses GROQ_API_KEY env variable
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain rate limiting"}]
)

Key Characteristics

Property Value
Speed Up to 276 tokens/second (Llama 3.3 70B)
Models Open-weight only (Llama, Qwen, Mistral, etc.)
API format OpenAI-compatible
Free tier Generous free tier with rate limits
Provider Groq Inc.
Website groq.com
Docs console.groq.com/docs

Further Reading