Technology RadarTechnology Radar
Assess

Kreuzberg is a polyglot document intelligence framework with a Rust core that extracts text, metadata, and structured information from PDFs, Office documents, images, and 88+ formats. It is designed for RAG pipelines and agent workflows, with async-first Python bindings, multiple OCR backends, and an MCP server mode.

Why It's in Assess

  • RAG pipeline enabler: Agents that need to read documents — PDFs, DOCX, spreadsheets, images with OCR — need a text extraction layer. Kreuzberg fills this gap with a lightweight, local-first approach (no API calls, no cloud dependencies).
  • MCP server mode: Can run as an MCP server, making document extraction directly available to MCP-capable agents — a natural fit for agentic workflows.
  • Performance advantage: Rust core delivers 10–50x speed improvement over Python-only alternatives. Installation footprint is 71MB versus Docling's 1GB+.
  • Broad format support: 88+ formats including PDF, DOCX, XLSX, PPTX, EPUB, HTML, images (with Tesseract, EasyOCR, or PaddleOCR backends).
  • Still maturing: Active development (v3.3.0), but the ecosystem is young. Assess — evaluate for your RAG pipeline, but the connection to coding agents specifically is indirect.

Key Capabilities

Feature Detail
Text extraction PDF (native + OCR), Office (DOCX, XLSX, PPTX), EPUB, HTML, RTF, CSV, images
OCR backends Tesseract, EasyOCR, PaddleOCR
Table extraction Structured table output from PDFs and spreadsheets
Async/await Non-blocking processing via anyio and worker processes
MCP server Runs as a Model Context Protocol server for agent integration
Multi-language Python, Rust, TypeScript, Ruby, Go, Java, C#, PHP, Elixir, R, C

Key Characteristics

Property Value
Website docs.kreuzberg.dev
GitHub kreuzberg-dev/kreuzberg
PyPI kreuzberg
Core language Rust (with Python bindings)
License MIT
Install size ~71MB
Formats 88+
Dependencies Tesseract OCR, Pandoc (system)