Assess
Kreuzberg is a polyglot document intelligence framework with a Rust core that extracts text, metadata, and structured information from PDFs, Office documents, images, and 88+ formats. It is designed for RAG pipelines and agent workflows, with async-first Python bindings, multiple OCR backends, and an MCP server mode.
Why It's in Assess
- RAG pipeline enabler: Agents that need to read documents — PDFs, DOCX, spreadsheets, images with OCR — need a text extraction layer. Kreuzberg fills this gap with a lightweight, local-first approach (no API calls, no cloud dependencies).
- MCP server mode: Can run as an MCP server, making document extraction directly available to MCP-capable agents — a natural fit for agentic workflows.
- Performance advantage: Rust core delivers 10–50x speed improvement over Python-only alternatives. Installation footprint is 71MB versus Docling's 1GB+.
- Broad format support: 88+ formats including PDF, DOCX, XLSX, PPTX, EPUB, HTML, images (with Tesseract, EasyOCR, or PaddleOCR backends).
- Still maturing: Active development (v3.3.0), but the ecosystem is young. Assess — evaluate for your RAG pipeline, but the connection to coding agents specifically is indirect.
Key Capabilities
| Feature | Detail |
|---|---|
| Text extraction | PDF (native + OCR), Office (DOCX, XLSX, PPTX), EPUB, HTML, RTF, CSV, images |
| OCR backends | Tesseract, EasyOCR, PaddleOCR |
| Table extraction | Structured table output from PDFs and spreadsheets |
| Async/await | Non-blocking processing via anyio and worker processes |
| MCP server | Runs as a Model Context Protocol server for agent integration |
| Multi-language | Python, Rust, TypeScript, Ruby, Go, Java, C#, PHP, Elixir, R, C |
Key Characteristics
| Property | Value |
|---|---|
| Website | docs.kreuzberg.dev |
| GitHub | kreuzberg-dev/kreuzberg |
| PyPI | kreuzberg |
| Core language | Rust (with Python bindings) |
| License | MIT |
| Install size | ~71MB |
| Formats | 88+ |
| Dependencies | Tesseract OCR, Pandoc (system) |