Every tool, dataset and project that matters: curated, verified, and stripped of marketing noise. Search, don’t scroll.
// most popular
etl-frameworks
Foundational library for data manipulation and analysis in Python. Provides fast, flexible, and expressive data structures (DataFrames) designed for working with structured, tabular, and time series data. Essential tool for data wrangling with comprehensive features for indexing, grouping, merging, and filtering.
etl-frameworks
Python API for Apache Spark, enabling scalable and efficient data processing. Particularly useful for ETL processes involving large datasets that need parallel processing across a cluster.
etl-frameworks
Open-source transformation tool enabling data analysts and engineers to transform, test, and document data in the warehouse. Focuses on the transform part of ETL with SQL templating and Python scripting.
// practice with real data
Current weather, forecasts, and historical data for any location. Free tier gets you 1,000 calls/day.
Space imagery, Mars rover photos, and asteroid data from NASA. One of the most fun free APIs out there.
Tweets, profiles, trends, and engagement metrics. Good for NLP and social analytics projects.
Repos, commits, PRs, and org data. Perfect for developer analytics and open source research.
Posts, comments, and upvotes via PRAW. Great starting point for sentiment analysis and NLP.
Article content, summaries, and page views. Massive corpus, completely free and open.
// learn by building
Get your machine set up for data engineering the right way: Python, virtual environments, VS Code, and Git.
Containerize your pipelines so they run the same everywhere. Covers Docker, Compose, and common DE patterns.
Pull weather data from a REST API and land it in DuckDB using dlt. A clean intro to the modern EL pattern.
Build a full ETL pipeline with Prefect orchestration. Fun dataset, real-world pipeline skills.
Load CSV files, clean messy data, and answer business questions with Pandas. Classic starter project.
Schema validation for Python dicts with minimal boilerplate. Good for validating API responses and configs.
// jump to a category
// common questions
Python data engineering is the practice of designing, building, and maintaining systems that collect, store, process, and analyze large volumes of data using Python and its rich ecosystem of tools. Data engineers use Python to create robust data pipelines, automate ETL workflows, manage databases, and ensure data quality for analytics and machine learning applications.
This comprehensive directory helps you discover and master the essential Python tools for data engineering. From ORMs like SQLAlchemy to big data frameworks like PySpark, from orchestration tools like Apache Airflow to data quality libraries like Great Expectations—we’ve curated 245+ production-ready tools, 128+ free datasets, and 33 hands-on projects to accelerate your data engineering journey.
Start with the essentials: Python 3.8+, a code editor like VS Code, and version control with Git. For data manipulation, learn Pandas and NumPy. For databases, start with SQLAlchemy (ORM) and PostgreSQL. As you progress, explore orchestration tools like Apache Airflow, ETL frameworks like dbt, and big data tools like PySpark. Our Getting Started category has everything you need.
ETL (Extract, Transform, Load) is a complete data integration process: extracting from sources, transforming for analytics, and loading into warehouses. Tools like dbt, Apache Spark, and Airflow handle full ETL pipelines. Data wrangling focuses specifically on cleaning and preparing messy data: handling missing values, normalizing formats, and reshaping datasets. Pandas and Polars excel at data wrangling. Think of wrangling as the “Transform” step within ETL.
It depends on your framework: Use Django ORM if you’re building with Django. Choose SQLAlchemy for maximum flexibility and complex queries, especially with Flask or standalone applications. For lightweight projects, Peewee offers simplicity. For async applications with FastAPI, consider Tortoise ORM or the encode ORM. Check our ORMs category for detailed comparisons.
Follow this learning path: (1) Master Python fundamentals and SQL, (2) Learn Pandas for data manipulation, (3) Understand databases with PostgreSQL and SQLAlchemy, (4) Build ETL pipelines with simple tools like Python scripts, (5) Learn orchestration with Apache Airflow, (6) Explore big data with PySpark. Most importantly, learn by doing: check our 33 hands-on projects designed for beginners to advanced practitioners.
Batch processing handles large volumes of data at scheduled intervals (hourly, daily), like processing yesterday’s sales data each morning. Tools: Apache Spark, dbt, Pandas. Stream processing handles data in real-time as it arrives, like processing credit card transactions instantly for fraud detection. Tools: Apache Kafka, Apache Flink, Apache Spark Streaming. Choose batch for historical analysis and reporting, stream for real-time alerts and immediate insights. Many modern systems use both.
We feature a mix: many tools are free and open-source (Pandas, Apache Airflow, SQLAlchemy), while others offer freemium models (cloud platforms like AWS, Azure, GCP) or enterprise pricing (Databricks, Snowflake). Each tool listing clearly indicates its pricing model. We focus on production-ready tools used by real data engineering teams, regardless of licensing. Filter by the “free” or “opensource” tags to see only free options.
// stay in the loop
New tools, projects, and datasets in your inbox. Every week. Free.