// data-ingestion

Kreuzberg

Polyglot Document Intelligence

About Kreuzberg

A polyglot document intelligence library with a Rust core and bindings for Python, TypeScript, Go, and more. Kreuzberg extracts text and structured data from documents like PDFs, images, and office files for data ingestion pipelines.

Key Features

1Python library for extracting text from PDFs, images, Office documents, and HTML
2Async-first API for non-blocking document processing in Python services
3Uses Tesseract OCR for image-based text extraction
4Returns structured text with metadata about the source document
5Minimal configuration — sensible defaults for common document types

How Python Data Engineers Use Kreuzberg

Python data engineers use Kreuzberg to build document ingestion pipelines that extract text from uploaded PDFs, scanned images, and Office files. The async API integrates cleanly into FastAPI-based document processing services — an endpoint accepts a file upload, Kreuzberg extracts the text asynchronously, and the pipeline stores the result in a search index or warehouse for downstream analysis.

Frequently Asked Questions

What is Kreuzberg used for?▾

Is Kreuzberg free to use?▾

Yes, Kreuzberg is free to use.

What category does Kreuzberg belong to?▾

Kreuzberg is listed under the Data Ingestion category on Python Data Engineering.

Verified Listing

Visit Website

// contains affiliate links

Details

Similar Data Ingestion Tools

3 tools

Tool	Pricing	Rating
AD AWS Data Wrangler AWS Data Utility Belt for Python	Free	★ 4.3	→
CF CsvPath Frameworknew Delimited Data Preboarding	Free	★ 3.7	→
MA Mage.AInew Data Pipeline Tool	Freemium	★ 4.6	→