Kreuzberg

Mar 2026

Assess

Kreuzberg is a polyglot document intelligence framework with a Rust core that extracts text, metadata, and structured information from PDFs, Office documents, images, and 88+ formats. It is designed for RAG pipelines and agent workflows, with async-first Python bindings, multiple OCR backends, and an MCP server mode.

Why It's in Assess

RAG pipeline enabler: Agents that need to read documents — PDFs, DOCX, spreadsheets, images with OCR — need a text extraction layer. Kreuzberg fills this gap with a lightweight, local-first approach (no API calls, no cloud dependencies).
MCP server mode: Can run as an MCP server, making document extraction directly available to MCP-capable agents — a natural fit for agentic workflows.
Performance advantage: Rust core delivers 10–50x speed improvement over Python-only alternatives. Installation footprint is 71MB versus Docling's 1GB+.
Broad format support: 88+ formats including PDF, DOCX, XLSX, PPTX, EPUB, HTML, images (with Tesseract, EasyOCR, or PaddleOCR backends).
Still maturing: Active development (v3.3.0), but the ecosystem is young. Assess — evaluate for your RAG pipeline, but the connection to coding agents specifically is indirect.

Key Capabilities

Feature	Detail
Text extraction	PDF (native + OCR), Office (DOCX, XLSX, PPTX), EPUB, HTML, RTF, CSV, images
OCR backends	Tesseract, EasyOCR, PaddleOCR
Table extraction	Structured table output from PDFs and spreadsheets
Async/await	Non-blocking processing via anyio and worker processes
MCP server	Runs as a Model Context Protocol server for agent integration
Multi-language	Python, Rust, TypeScript, Ruby, Go, Java, C#, PHP, Elixir, R, C

Key Characteristics

Property	Value
Website	docs.kreuzberg.dev
GitHub	kreuzberg-dev/kreuzberg
PyPI	kreuzberg
Core language	Rust (with Python bindings)
License	MIT
Install size	~71MB
Formats	88+
Dependencies	Tesseract OCR, Pandoc (system)