248+ tools·128+ datasets·33 projects·updated daily

The reference stack for Python data engineers.

Q: What's the difference between ETL and data wrangling?

ETL (Extract, Transform, Load) is a complete data integration process: extracting from sources, transforming for analytics, and loading into warehouses. Data wrangling focuses specifically on cleaning and preparing messy data: handling missing values, normalizing formats, and reshaping datasets. Think of wrangling as the Transform step within ETL.

Q: Which Python ORM should I use for my project?

It depends on your framework: Use Django ORM if you're building with Django. Choose SQLAlchemy for maximum flexibility and complex queries, especially with Flask or standalone applications. For lightweight projects, Peewee offers simplicity. For async applications with FastAPI, consider Tortoise ORM or the encode ORM.

Q: What's the difference between batch and stream processing?

Batch processing handles large volumes of data at scheduled intervals (hourly, daily). Stream processing handles data in real-time as it arrives. Choose batch for historical analysis and reporting, stream for real-time alerts and immediate insights. Many modern systems use both.

Every tool, dataset and project that matters: curated, verified, and stripped of marketing noise. Search, don’t scroll.

jump to:etl-frameworks orchestration stream-processing data-quality warehouses data-wrangling

View on Amazon ↗|

◆4.7/5 rating

Trusted by 1,000+ data engineers

// most popular

Popular Python Data Engineering Tools

View all tools →

Pandas

etl-frameworks

Foundational library for data manipulation and analysis in Python. Provides fast, flexible, and expressive data structures (DataFrames) designed for working with structured, tabular, and time series data. Essential tool for data wrangling with comprehensive features for indexing, grouping, merging, and filtering.

Free◆ 4.9

PySpark

etl-frameworks

Python API for Apache Spark, enabling scalable and efficient data processing. Particularly useful for ETL processes involving large datasets that need parallel processing across a cluster.

Free◆ 4.8

dbt (Data Build Tool)

etl-frameworks

Open-source transformation tool enabling data analysts and engineers to transform, test, and document data in the warehouse. Focuses on the transform part of ETL with SQL templating and Python scripting.

Freemium◆ 4.9

View all projects →

$pythonpython_development_environment_setup.pybeginner

Python Development Environment Setup

Get your machine set up for data engineering the right way: Python, virtual environments, VS Code, and Git.

python

$pythondocker_for_data_engineering.pybeginner

Docker for Data Engineering

Containerize your pipelines so they run the same everywhere. Covers Docker, Compose, and common DE patterns.

docker

$pythondlt_weather_api_pipeline.pybeginner

Weather Data Pipeline with DLT

Pull weather data from a REST API and land it in DuckDB using dlt. A clean intro to the modern EL pattern.

dlt

$pythonprefect_pokemon_etl.pybeginner

Pokemon ETL Pipeline with Prefect

Build a full ETL pipeline with Prefect orchestration. Fun dataset, real-world pipeline skills.

prefect

$pythonpandas_sales_data_analysis.pybeginner

Sales Data Analysis with Pandas

Load CSV files, clean messy data, and answer business questions with Pandas. Classic starter project.

pandas

$pythoncerberus_flexible_validation.pybeginner

Flexible Data Validation with Cerberus

Schema validation for Python dicts with minimal boilerplate. Good for validating API responses and configs.

cerberus

// jump to a category

Python Data Engineering Categories

View all 29 categories →

Data/Schema Validation

7 tools

Database Migration Tools

// common questions

Frequently Asked Questions

What is Python data engineering?

Python data engineering is the practice of designing, building, and maintaining systems that collect, store, process, and analyze large volumes of data using Python and its rich ecosystem of tools. Data engineers use Python to create robust data pipelines, automate ETL workflows, manage databases, and ensure data quality for analytics and machine learning applications.

This comprehensive directory helps you discover and master the essential Python tools for data engineering. From ORMs like SQLAlchemy to big data frameworks like PySpark, from orchestration tools like Apache Airflow to data quality libraries like Great Expectations—we’ve curated 248+ production-ready tools, 128+ free datasets, and 33 hands-on projects to accelerate your data engineering journey.

What tools do I need to get started with Python data engineering?

Start with the essentials: Python 3.8+, a code editor like VS Code, and version control with Git. For data manipulation, learn Pandas and NumPy. For databases, start with SQLAlchemy (ORM) and PostgreSQL. As you progress, explore orchestration tools like Apache Airflow, ETL frameworks like dbt, and big data tools like PySpark. Our Getting Started category has everything you need.

What’s the difference between ETL and data wrangling?

ETL (Extract, Transform, Load) is a complete data integration process: extracting from sources, transforming for analytics, and loading into warehouses. Tools like dbt, Apache Spark, and Airflow handle full ETL pipelines. Data wrangling focuses specifically on cleaning and preparing messy data: handling missing values, normalizing formats, and reshaping datasets. Pandas and Polars excel at data wrangling. Think of wrangling as the “Transform” step within ETL.

Which Python ORM should I use for my project?

It depends on your framework: Use Django ORM if you’re building with Django. Choose SQLAlchemy for maximum flexibility and complex queries, especially with Flask or standalone applications. For lightweight projects, Peewee offers simplicity. For async applications with FastAPI, consider Tortoise ORM or the encode ORM. Check our ORMs category for detailed comparisons.

How do I learn Python data engineering as a beginner?

Follow this learning path: (1) Master Python fundamentals and SQL, (2) Learn Pandas for data manipulation, (3) Understand databases with PostgreSQL and SQLAlchemy, (4) Build ETL pipelines with simple tools like Python scripts, (5) Learn orchestration with Apache Airflow, (6) Explore big data with PySpark. Most importantly, learn by doing: check our 33 hands-on projects designed for beginners to advanced practitioners.

What’s the difference between batch and stream processing?

Batch processing handles large volumes of data at scheduled intervals (hourly, daily), like processing yesterday’s sales data each morning. Tools: Apache Spark, dbt, Pandas. Stream processing handles data in real-time as it arrives, like processing credit card transactions instantly for fraud detection. Tools: Apache Kafka, Apache Flink, Apache Spark Streaming. Choose batch for historical analysis and reporting, stream for real-time alerts and immediate insights. Many modern systems use both.

Are all the tools in this directory free and open-source?

We feature a mix: many tools are free and open-source (Pandas, Apache Airflow, SQLAlchemy), while others offer freemium models (cloud platforms like AWS, Azure, GCP) or enterprise pricing (Databricks, Snowflake). Each tool listing clearly indicates its pricing model. We focus on production-ready tools used by real data engineering teams, regardless of licensing. Filter by the “free” or “opensource” tags to see only free options.

// stay in the loop

Stay current. Zero noise.

New tools, projects, and datasets in your inbox. Every week. Free.