Data Ingestion
Database to Data Lake ETL
★ 3.5
Open-Source Change Data Capture Platform
★ 4.7
pip install db2lakeN/A — Java-based Kafka connectorpip install db2lakeN/A — Java-based Kafka connectorPython data engineers use db2lake to bootstrap data lake migration projects — extracting historical data from relational databases and writing it as partitioned Parquet files to S3 or HDFS. Once the initial migration is done, incremental extractions keep the lake in sync, and Python-based PySpark or DuckDB pipelines take over for ongoing processing.
Python data engineers typically run Debezium as the CDC producer and write Python consumers of the change streams it generates. After deploying Debezium connectors via Docker Compose or Kubernetes, Python services consume CDC events from Kafka topics using confluent-kafka or kafka-python — receiving full before/after row images for every database change, which are then written as Parquet to S3 or applied as upserts to a data warehouse. For teams without Kafka, Debezium Server sinks directly to AWS Kinesis or Redis Streams, both of which have first-class Python client libraries (boto3, redis-py), keeping the Python integration straightforward.
Individual Tool Pages