Groq

Jun 2026

Trial

Groq is a cloud inference provider that runs open-source models at exceptional speeds using purpose-built Language Processing Units (LPUs) — often 4x faster than GPU-based services. It's the right choice when latency or tokens-per-second are the bottleneck.

Buy vs Build

Groq is a pure buy: you call their API (OpenAI-compatible), pay per token, and don't manage any infrastructure. There is no self-hosted version of the LPU chip.

Why It's in Trial

Groq's LPU architecture is fundamentally different from GPU-based inference: while GPUs are optimised for parallel matrix math (training), LPUs are optimised for the sequential token generation pattern that defines LLM inference. The result:

276 tokens/second for Llama 3.3 70B — the fastest available in benchmarks
Sub-100ms first-token latency for interactive applications
No cold start delays (models are pre-loaded and always hot)

When Groq Matters

Most AI applications aren't bottlenecked on inference speed — users are patient enough to wait 3-5 seconds. Groq becomes the right tool when:

Real-time voice/chat applications: Response latency is user-facing
High-throughput pipelines: Processing thousands of documents or code files quickly
Agentic loops: Multi-step agent workflows where each step calls the LLM — latency compounds
Developer experience: Faster models mean faster iteration during development

Supported Models

Llama 4, Llama 3.3, Qwen, Gemma, DeepSeek, Mistral, and more. Groq doesn't serve Anthropic or OpenAI models — only open-weight models. For Claude or GPT-4, use the native APIs.

Getting Started

from groq import Groq

client = Groq()  # uses GROQ_API_KEY env variable
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain rate limiting"}]
)

Key Characteristics

Property	Value
License	Proprietary SaaS
Models	Open-weight only (Llama, Qwen, Mistral, etc.)
Pricing	Free tier / Pay-per-token
Provider	Groq Inc.
Website	groq.com
Docs	console.groq.com/docs