Data Lake Tools & Datasets for Python Data Engineering

Discover 7 tools tagged with Data Lake for Python data engineering.

Data lake tools manage the storage, cataloguing, versioning, and governance of large repositories of raw data in cloud object stores like S3 and GCS. Python data engineers use data lake frameworks like Apache Iceberg, Delta Lake, and Apache Hudi to implement ACID transactions, schema evolution, and time-travel queries on data lake tables.

Tools (7)

Featured

Apache Hudi

Incremental Data Processing Framework

An open-source framework for managing storage for real-time data processing on top of data lakes. Hudi provides record-level insert, update, and delete capabilities along with change streams, enabling incremental data pipelines on large-scale datasets.

Free

◆4.4

Details Visit

Apache Gravitino

Unified Metadata Management

An open-source, unified metadata management platform for data lakes, data warehouses, and external catalogs. Gravitino provides a single point of access for managing metadata across diverse data sources, simplifying governance and discovery.

Free

◆4

Details Visit

db2lake

Database to Data Lake ETL

A lightweight Node.js ETL framework for moving data from databases to data lakes and data warehouses. db2lake provides simple configuration-driven extraction with support for incremental loads and multiple output formats.

Free

◆3.5

Details Visit

Featured

lakeFS

Git-Like Data Lake Versioning

An open-source platform that delivers resilience and manageability to object-storage-based data lakes. lakeFS provides git-like branching, merging, and versioning for data, enabling safe experimentation and CI/CD workflows for data pipelines.

Freemium

◆4.5

Details Visit

Project Nessie

Transactional Data Lake Catalog

A transactional catalog for data lakes with git-like semantics. Nessie works with Apache Iceberg tables to provide multi-table transactions, branching, tagging, and time-travel queries across your data lake.

Free

◆4.3

Details Visit

Ilum

Data Lakehouse Platform

A modular data lakehouse platform that simplifies the management and monitoring of Apache Spark clusters. Ilum provides a unified interface for running Spark jobs, managing data pipelines, and monitoring cluster health in lakehouse architectures.

Freemium

◆3.9

Details Visit

FlightPath Data

Data Lake Bronze Layer Gateway

A gateway to a data lake's bronze layer that handles raw data ingestion and landing. FlightPath provides a managed entry point for data flowing into your data lake, ensuring consistent formatting and quality at the ingestion stage.

Freemium

◆3.7

Details Visit

Data Lake Tools & Datasets for Python Data Engineering

Discover 7 tools tagged with Data Lake for Python data engineering.

Tools (7)