// learn by building
33 hands-on projects to take you from first pipeline to production-grade systems. Work through them in order or jump to what you need.
Filter by topic
// start here
Build your first pipelines. Each project assumes only basic Python and introduces one tool or concept at a time.
Get your machine set up for data engineering the right way: Python, virtual environments, VS Code, and Git.
Containerize your pipelines so they run the same everywhere. Covers Docker, Compose, and common DE patterns.
Pull weather data from a REST API and land it in DuckDB using dlt. A clean intro to the modern EL pattern.
Build a full ETL pipeline with Prefect orchestration. Fun dataset, real-world pipeline skills.
Load CSV files, clean messy data, and answer business questions with Pandas. Classic starter project.
Schema validation for Python dicts with minimal boilerplate. Good for validating API responses and configs.
Discover Peewee, a small and expressive ORM perfect for simpler database applications. Learn to define models, query databases, and perform CRUD operations with minimal boilerplate code - ideal for scripts, small applications, and rapid prototyping.
Explore Django's built-in migration system to manage database schema changes seamlessly. This project walks through creating models, generating migrations, adding fields, creating new models, and using Django's ORM to interact with migrated data - perfect for understanding Django's approach to schema evolution.
Master Flask-Migrate, the Flask extension that integrates Alembic for database migrations. Learn to manage schema changes in Flask applications, add fields to models, create new tables, and maintain database version control - essential for Flask-based data applications.
Create publication-quality visualizations using Matplotlib's powerful plotting capabilities. Learn to build custom charts, control plot aesthetics, create subplots, and export figures in multiple formats - the foundation of data visualization in Python.
Master statistical visualization with Seaborn's high-level interface. Learn to create attractive distribution plots, regression visualizations, and categorical comparisons with minimal code - perfect for rapid data exploration and analysis.
Create a simple, flexible REST API using Flask's minimalist approach. Perfect for learning API fundamentals, microservices, or rapid prototyping. Learn routing, request handling, and how to structure lightweight data services.
// level up
Combine multiple tools, add orchestration and testing, and build systems that look like real production work.
Build a complete ETL pipeline using PySpark to process e-commerce data, including sales analysis, customer segmentation, and product performance metrics. Learn how to leverage Spark's distributed processing capabilities for large-scale data transformations.
Build a complete dbt project with staging models, core business logic, dashboard models, and tests to transform e-commerce data in a PostgreSQL warehouse. Master the modern data transformation workflow used by data teams worldwide.
Build a data pipeline with Dagster to fetch stock data from Yahoo Finance, calculate moving averages, and store results in a database. Learn Dagster's functional approach with ops, jobs, and schedules while working with real financial data.
Learn to process datasets larger than memory using Dask's parallel computing capabilities. This project demonstrates how to read multiple log files, perform distributed aggregations, and efficiently process big data that would be impossible with standard Pandas.
Process and analyze time-series sensor data using NumPy's powerful array operations. Learn to perform statistical analysis, smooth data with rolling averages, detect anomalies, and visualize results - essential skills for IoT and monitoring applications.
Explore why Polars outperforms Pandas for file-based ETL above 1 GB. Understand the structural differences between eager single-threaded execution and Polars lazy multi-core evaluation, study benchmark evidence from real production migrations (94x on PDS-H, 17.5x at DB Systel), and apply a practical decision framework — including a hybrid approach for ML pipelines.
Build a robust financial transaction validation system using Pydantic's powerful type annotations and custom validators. Learn to validate complex nested data, enforce business rules, handle decimal precision for money, and create type-safe data models perfect for FastAPI applications.
Master complex data transformation and validation using Marshmallow schemas. Learn to serialize Python objects to JSON, deserialize and validate incoming data, handle nested relationships, and implement custom validation logic essential for robust API development.
Learn the most popular Python ORM by building a complete database application with SQLAlchemy. Master model definitions, database sessions, CRUD operations, and query building. Essential skills for any data engineer working with relational databases in Python.
Build database-backed applications using Django's powerful built-in ORM. Learn to define models, use Django's admin interface, perform queries with the QuerySet API, and leverage Django's migrations system - perfect for full-stack data applications.
Learn how to use Alembic, the standalone database migration tool for SQLAlchemy, to manage schema changes over time. This project demonstrates creating initial migrations, adding columns, creating new tables, and rolling back changes - essential skills for maintaining database schemas in production.
Build a machine learning model to predict customer churn using Scikit-learn's Random Forest classifier. Learn data preprocessing, model training, evaluation metrics, cross-validation, and feature importance analysis - foundational ML skills every data engineer should master.
Build interactive, web-ready visualizations and dashboards using Plotly. Learn to create charts with hover tooltips, zoom capabilities, and dynamic filtering - essential for modern data storytelling and business intelligence applications.
Build a high-performance REST API using FastAPI with automatic OpenAPI documentation and built-in validation. Learn async endpoints, Pydantic models, and best practices for modern Python API development - the fastest way to create production-ready data APIs.
Build event-driven data pipelines using Apache Kafka with the confluent-kafka Python client. Learn to produce and consume messages, handle topics, and create the foundation for real-time data streaming architectures.
Build Pythonic stream processing applications using Faust, a library designed specifically for Python developers. Learn async stream processing, stateful operations, and how to create real-time data pipelines with familiar Python syntax.
// go deep
Production-scale challenges: distributed processing, complex architectures, performance tuning, and streaming systems.
Build a production-ready data pipeline using Apache Airflow DAGs to process daily orders. Learn to orchestrate complex workflows with multiple operators, implement branching logic, data quality checks, and handle task dependencies in the industry's most widely-adopted orchestration tool.
Create a deep learning time series forecasting model using TensorFlow and LSTM networks. Learn to build windowed datasets, train neural networks with Keras API, make future predictions, and deploy models - essential for any time-series prediction task.
Implement an autoencoder neural network in PyTorch for unsupervised anomaly detection in network traffic. Learn PyTorch's nn.Module, custom datasets, DataLoaders, and how to identify outliers - critical for security and monitoring applications.
Build a robust, full-featured REST API using Django REST Framework. Learn serializers, viewsets, authentication, permissions, and pagination - ideal for complex, database-driven APIs that need enterprise-grade features out of the box.
Process unbounded data streams using Apache Flink and PyFlink. Learn stateful computations, event-time processing, and windowing operations - essential for building sophisticated real-time analytics and continuous ETL pipelines.
// common questions