Data Lake Management
Git-Like Data Lake Versioning
★ 4.5
Transactional Data Lake Catalog
★ 4.3
pip install lakefspip install pynessiepip install lakefspip install pynessiePython data engineers use lakeFS to apply software engineering practices to data lake management. A pipeline writes to a lakeFS branch, data quality tests run against the branch, and the Python SDK merges the branch to main only on test success. This prevents bad pipeline outputs from reaching production consumers — the same guarantee that Git branches provide for code changes.
Python data engineers configure PySpark to use Project Nessie as the Iceberg catalog — enabling table branching within Spark jobs. An engineer creates a Nessie branch, runs a PySpark transformation that modifies multiple Iceberg tables, validates the results, then merges the branch to main — providing atomic multi-table updates with full rollback capability.
Individual Tool Pages