Open Source Tools & Datasets for Python Data Engineering

Explore 98 tools and 5 datasets tagged with Open Source for Python data engineering.

Tools (98)

Featured

Pandas

Data Manipulation & Analysis Library

Powerful Python library for data manipulation and analysis, offering DataFrame structures for efficient data cleaning, transformation, and analysis. Often used in the transform phase of ETL processes.

Free

4.9

Details Visit

Petl

Python ETL Package

Python package specifically designed for ETL tasks, offering tools for data extraction, transformation, and loading. Suitable for simpler, script-based ETL processes.

Free

4.3

Details Visit

Featured

PySpark

Python API for Apache Spark

Python API for Apache Spark, enabling scalable and efficient data processing. Particularly useful for ETL processes involving large datasets that need parallel processing across a cluster.

Free

4.8

Details Visit

DLT (Data Load Tool)

Python Data Loading Library

Python library that facilitates the loading phase in ETL processes. Designed to simplify loading data into various data stores or processing systems.

Free

4.5

Details Visit

Featured

dbt (Data Build Tool)

Transform Data in Your Warehouse

Open-source transformation tool enabling data analysts and engineers to transform, test, and document data in the warehouse. Focuses on the transform part of ETL with SQL templating and Python scripting.

Freemium

4.9

Details Visit

Bonobo

Lightweight ETL Framework

Lightweight Extract-Transform-Load (ETL) framework for Python 3.6+. Allows writing ETL scripts in pure Python, particularly suited for simple and straightforward ETL tasks.

Free

4.2

Details Visit

Mage.AI

Data Pipeline Tool

Modern data pipeline tool focused on automating data preparation and feature engineering for machine learning. Streamlines the data transformation process in ETL workflows.

Freemium

4.6

Details Visit

Featured

Apache Airflow

Workflow Orchestration Platform

Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.

Free

4.8

Details Visit

Luigi

Batch Job Pipeline Builder

Developed by Spotify, Luigi helps build complex pipelines of batch jobs, handling dependency resolution, workflow management, and task visualization.

Free

4.4

Details Visit

Apache NiFi

Data Flow Automation

Easy-to-use, powerful, and reliable system to process and distribute data, offering a web-based user interface for data flow management.

Free

4.5

Details Visit

Featured

Prefect

Modern Workflow Orchestration

Workflow management system designed for modern infrastructure, with a focus on simplicity, ease of use, and flexibility in defining and executing workflows.

Freemium

4.7

Details Visit

Featured

Dagster

Data Orchestrator for ML & Analytics

Open-source data orchestrator for machine learning, analytics, and ETL. Focuses on development, production, and observation of data assets with integrated pipeline views.

Freemium

4.7

Details Visit

Argo Workflows

Kubernetes-Native Workflow Engine

Open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Designed for large-scale computational tasks with powerful workflow features.

Free

4.6

Details Visit

Dask

Parallel Computing Library

Parallel computing library that scales Pandas workflows to larger-than-memory datasets. Enables parallel processing while maintaining a familiar Pandas-like interface for big data.

Free

4.6

Details Visit

Featured

NumPy

Numerical Computing Library

Fundamental library for numerical computing in Python. Supports large multi-dimensional arrays and matrices with a vast collection of mathematical functions for array operations.

Free

4.9

Details Visit

Beautiful Soup

Web Scraping & HTML Parsing

Library for web scraping and parsing HTML/XML documents. Extensively used in data wrangling to clean, parse, and extract data from web sources.

Free

4.5

Details Visit

Scrapy

Web Crawling Framework

Powerful web crawling and scraping framework for extracting, cleaning, and processing large volumes of web data. Essential for data wrangling from web sources.

Free

4.6

Details Visit

TextBlob

Text Processing Library

Simple library for processing textual data with APIs for common NLP tasks. Essential for data wrangling when dealing with text data and natural language processing.

Free

4.3

Details Visit

Featured

Pydantic

Data Validation using Type Hints

Data validation and settings management library using Python type annotations. Ensures data conforms to defined schemas with Python's typing module, perfect for FastAPI and modern Python apps.

Free

4.9

Details Visit

Featured

Marshmallow

Object Serialization & Validation

ORM/ODM/framework-agnostic library for object serialization and deserialization. Converts complex data types to and from native Python datatypes with robust validation.

Free

4.7

Details Visit

Cerberus

Lightweight Data Validation

Lightweight and extensible data validation library supporting complex data structures with customizable validation rules. Highly flexible for various validation needs.

Free

4.5

Details Visit

Voluptuous

Python Data Structure Validation

Validates Python data structures with straightforward syntax and clear error messages. Ensures structure and content adhere to specified schemas.

Free

4.3

Details Visit

jsonschema

JSON Schema Validator

Library for validating JSON data against JSON Schema standards. Essential when working with JSON data formats to ensure schema compliance.

Free

4.6

Details Visit

Featured

Pandera

DataFrame Validation

Flexible API for data validation on dataframe structures. Validates dataframes in real-time, integrates with pydantic and fastapi. Essential for production data pipelines.

Free

4.7

Details Visit

Validr

Fast Validation Library

Fast, simple, and powerful validation library with declarative validation rules. Optimized for performance when validating data from various sources.

Free

4.2

Details Visit

Featured

SQLAlchemy

Python SQL Toolkit & ORM

Widely used ORM library providing a full suite of enterprise-level persistence patterns. Designed for efficient, high-performing database access with flexible SQL abstraction.

Free

4.9

Details Visit

Featured

Django ORM

Django's Built-in ORM

Part of Django web framework, allows defining data models entirely in Python. Provides powerful abstraction layer to translate Python code to SQL seamlessly.

Free

4.8

Details Visit

Peewee

Small Expressive ORM

Small, expressive ORM with simple and intuitive interface. Lightweight and easy to use, perfect for small to medium-sized applications prioritizing simplicity.

Free

4.6

Details Visit

Pony ORM

Pythonic Query Language

Unique ORM using generator expressions for queries. Intuitive and user-friendly, allowing complex queries in pure Python that mirror human language.

Free

4.5

Details Visit

SQLObject

Object Interface to Database

Popular ORM providing object-oriented interface with tables as classes and rows as instances. Supports variety of database backends with simplicity focus.

Free

4.2

Details Visit

Tortoise ORM

Async ORM for Python

Easy-to-use asyncio ORM inspired by Django. Designed for async/await syntax, making it perfect for asynchronous applications and modern Python development.

Free

4.6

Details Visit

Gino

Async SQLAlchemy ORM

Async ORM built on SQLAlchemy core for asyncio programming. Provides simple and intuitive API for asynchronous database interactions with high performance.

Free

4.4

Details Visit

Featured

Alembic

Database Migrations for SQLAlchemy

Lightweight database migration tool for use with SQLAlchemy. Alembic allows you to create, manage, and invoke change management scripts for your database, facilitating schema migrations as your application evolves.

Free

4.7

Details Visit

Flyway

Database Migration Tool

Robust version control tool for databases with support for SQL-based migrations. While not Python-specific, widely used in the community and easily integrated into Python projects for database schema management.

Free / Paid

4.6

Details Visit

Featured

Django Migrations

Built-in Django Migration Framework

Django's powerful built-in migration framework that comes bundled with Django. Allows you to change your database schema without losing data using a simple and intuitive API.

Free

4.8

Details Visit

Flask-Migrate

Database Migrations for Flask

Extension that handles SQLAlchemy database migrations for Flask applications using Alembic. Provides command-line tools to manage and automate database migrations in Flask projects.

Free

4.5

Details Visit

yoyo-migrations

Database Schema Migration Tool

Database schema migration tool that lets you manage your database schema by applying and rolling back migration scripts written in pure SQL or Python. Simple and flexible approach to database migrations.

Free

4.3

Details Visit

SQLAlchemy-Migrate

Schema Versioning for SQLAlchemy

Provides a way to deal with database schema changes in SQLAlchemy projects. Extends SQLAlchemy to have database schema versioning and migration capabilities for managing database evolution.

Free

4.2

Details Visit

South

Legacy Django Migrations

The original migration tool for Django before built-in migrations were added in Django 1.7. Still relevant for maintaining or upgrading legacy Django applications running older versions.

Free

Details Visit

Featured

Apache Kafka

Distributed Event Streaming Platform

Distributed event streaming platform capable of handling trillions of events a day. Used for building real-time streaming data pipelines and applications with high-throughput, fault-tolerance, and scalability.

Free

4.8

Details Visit

Featured

Apache Flink

Stream Processing Framework

Framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Known for high performance in streaming data processing with exactly-once semantics.

Free

4.7

Details Visit

Apache Storm

Real-Time Computation System

Real-time computation system making it easy to process unbounded streams of data reliably. Fast and scalable distributed real-time computation framework for stream processing.

Free

4.4

Details Visit

Faust

Python Stream Processing

Stream processing library porting ideas from Kafka Streams to Python. Used for building high-performance and reliable real-time stream processing applications with Pythonic API.

Free

4.5

Details Visit

Apache Spark Streaming

Scalable Stream Processing

Extension of Apache Spark API enabling scalable, high-throughput, fault-tolerant processing of live data streams. Integrated within Spark ecosystem for complex real-time data processing tasks.

Free

4.6

Details Visit

Redpanda

Modern Streaming Platform

Streaming data platform API-compatible with Apache Kafka but designed for better performance and easier operational management. Modern streaming platform for mission-critical workloads.

Free / Paid

4.6

Details Visit

Featured

Flask

Lightweight Web Framework

Lightweight WSGI web application framework easy to get started with and versatile for complex applications. Popular for building web APIs thanks to simplicity and extensibility.

Free

4.8

Details Visit

Featured

Django REST Framework

Powerful API Toolkit for Django

Powerful and flexible toolkit for building Web APIs in Django. Highly recommended for adding API capabilities to Django applications with comprehensive features and excellent documentation.

Free

4.9

Details Visit

Featured

FastAPI

Modern High-Performance Framework

Modern, fast (high-performance) web framework for building APIs with Python 3.7+ based on standard type hints. Features automatic API documentation, easy to use, and blazing fast execution.

Free

4.9

Details Visit

Tornado

Asynchronous Networking Library

Python web framework and asynchronous networking library. Particularly useful for long-polling, WebSockets, and applications requiring long-lived connections to each user.

Free

4.5

Details Visit

Falcon

High-Performance Python Framework

Reliable, high-performance Python framework for building large-scale app backends and microservices. Encourages REST architectural style while remaining highly effective and minimalist.

Free

4.6

Details Visit

Featured

Great Expectations

Data Validation & Documentation

Comprehensive tool helping data teams validate, document, and profile their data. Define expectations for your data ensuring it meets quality standards before processing.

Free / Paid

4.7

Details Visit

Ydata Profiling

Automated Data Profiling

Generates profile reports from pandas DataFrames. Excellent tool for quickly understanding data with interactive HTML reports including statistics, distributions, and correlations.

Free

4.6

Details Visit

PyDeequ

Data Quality for Big Data

Python API for Deequ, AWS library built on Apache Spark for defining and verifying data quality constraints. Useful for large-scale data processing and quality verification.

Free

4.5

Details Visit

Dedupe

ML-Powered Deduplication

Python library using machine learning to perform deduplication and entity resolution on structured data. Particularly useful for identifying and merging duplicate records.

Free

4.4

Details Visit

Soda Core

Data Quality Testing

Open-source data quality tool with CLI for defining, running, and monitoring data quality checks. Write tests to verify data meets conditions like missing values, ranges, or uniqueness.

Free / Paid

4.6

Details Visit

DataCleaner

Automated Data Cleaning

Automatic tool for cleaning and preprocessing data. Handles missing values, encodes categorical data, and scales features making data preparation efficient.

Free

4.2

Details Visit

Data Linter

Schema Validation Tool

Python package for automated data validation within Data Engineering pipelines. Engineered to ingest and validate tabular data against predefined schemas.

Free

4.1

Details Visit

Featured

Matplotlib

Comprehensive Visualization Library

Comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib is versatile and widely used for plotting graphs and charts with extensive customization options.

Free

4.8

Details Visit

Featured

Seaborn

Statistical Data Visualization

Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics, simplifying the creation of complex visualizations with beautiful default themes.

Free

4.7

Details Visit

Featured

Plotly

Interactive Visualization Library

Plotly offers a range of interactive plotting options and is known for its advanced graphics and interactivity, supporting complex visualizations with ease. Perfect for creating web-based dashboards.

Free / Paid

4.8

Details Visit

Bokeh

Interactive Web Visualizations

Bokeh focuses on building interactive, web-ready plots, which can be a powerful tool for creating dynamic visualizations that can easily be embedded in web applications.

Free

4.6

Details Visit

Altair

Declarative Visualization

Altair is a declarative statistical visualization library for Python, offering a simple and concise way to create a wide range of statistical plots based on a logical data mapping.

Free

4.5

Details Visit

Featured

Scikit-learn

Machine Learning in Python

Versatile library providing a range of supervised and unsupervised learning algorithms. Known for its ease of use and efficiency for data mining and data analysis with classical ML algorithms.

Free

4.9

Details Visit

Featured

TensorFlow

End-to-End ML Platform

End-to-end open-source platform for machine learning enabling complex computations with data flow graphs. Widely used for deep learning applications with robust production support.

Free

4.8

Details Visit

Featured

PyTorch

Deep Learning Framework

Open-source machine learning library known for its flexibility, ease of use, and as a preferred tool for research in deep learning and artificial intelligence. Dynamic computation graphs.

Free

4.8

Details Visit

Keras

High-Level Neural Networks API

High-level neural networks API designed for fast experimentation with deep neural networks. Runs on top of TensorFlow offering a user-friendly interface for building models.

Free

4.7

Details Visit

Featured

XGBoost

Extreme Gradient Boosting

Highly efficient implementation of gradient boosting frameworks designed for speed and performance. Widely used in machine learning competitions and practical applications for structured data.

Free

4.8

Details Visit

LightGBM

Light Gradient Boosting Machine

Gradient boosting framework using tree-based learning algorithms. Designed for speed and efficiency, supporting large datasets and distributed computing for various ML tasks.

Free

4.7

Details Visit

CatBoost

Gradient Boosting on Decision Trees

Algorithm for gradient boosting on decision trees developed by Yandex. Particularly effective for datasets with categorical features, known for robustness and handling overfitting well.

Free

4.6

Details Visit

Apache Hadoop

Distributed Storage and Processing Framework

Framework that allows for distributed processing of large datasets across clusters of computers using simple programming models. Designed to scale from single servers to thousands of machines, each offering local computation and storage. Uses HDFS for distributed storage and MapReduce for processing.

Free

4.2

Details Visit

Featured

Apache Beam

Unified Batch and Stream Processing

Advanced unified programming model for defining and executing data processing workflows that can run on any execution engine. Provides portability across multiple execution environments including Apache Flink, Apache Spark, and Google Cloud Dataflow. Ideal for building flexible, scalable data pipelines.

Free

4.5

Details Visit

Featured

Boto3

AWS SDK for Python

The official Amazon Web Services (AWS) SDK for Python. Enables Python developers to write software that makes use of services like Amazon S3, EC2, Lambda, and more. Provides easy-to-use, object-oriented API as well as low-level access to AWS services, making it simple to integrate Python applications with AWS infrastructure.

Free

4.8

Details Visit

Featured

Google Cloud Client Libraries

GCP SDK for Python

Google Cloud Platform's official client library for Python, enabling seamless integration with GCP services like Compute Engine, Cloud Storage, BigQuery, and Pub/Sub. Designed for a Pythonic, intuitive experience when interacting with Google Cloud services, with idiomatic code patterns and comprehensive documentation.

Free

4.7

Details Visit

Azure SDK for Python

Microsoft Azure SDK

Microsoft's comprehensive Azure SDK for Python offering a complete set of packages to interact with Azure resources and services. Supports wide range of Azure services including Virtual Machines, Storage, Databases, AI services, and more. Provides tools for effective resource management and service interaction within Azure ecosystem.

Free

4.6

Details Visit

IBM Cloud Python SDK

IBM Cloud Services SDK

Official SDK for interacting with various IBM Cloud services programmatically. Provides comprehensive support for IBM Cloud services including CIS, DNS, IAM, VPC, Watson AI, and more. Enables management and automation of IBM Cloud resources with Python, compatible with Python 3.6 and above.

Free

4.3

Details Visit

Oracle Cloud Infrastructure SDK

OCI SDK for Python

Official SDK for writing code to manage Oracle Cloud Infrastructure resources. Supports wide range of Oracle Cloud services with functionalities for compute, storage, networking, databases, and more. Available across multiple operating systems and Python versions, providing robust interface for OCI resource management.

Free

4.4

Details Visit

Featured

MySQL Workbench

MySQL Database Design Tool

Integrated tool provided by MySQL for database design, modeling, administration, and maintenance. Provides visual interface for creating, managing, and analyzing MySQL databases. Includes data modeling, SQL development, and comprehensive administration tools for MySQL database systems.

Free

4.4

Details Visit

ERAlchemy

ER Diagrams from SQLAlchemy

Python library designed to create Entity Relationship diagrams by extracting data from databases or SQLAlchemy models. Particularly useful for database designers and developers who need to visualize and interpret complex relationships within database systems. Generates diagrams automatically from your Python code.

Free

4.2

Details Visit

Dia - data-modeling tool for Python data engineering

Dia

Open Source Diagramming

Free and open-source diagramming tool that can be used to create Entity-Relationship diagrams. Versatile application suitable for simple modeling tasks, flowcharts, network diagrams, and database schemas. Lightweight alternative for developers who need basic ER diagram functionality.

Free

3.8

Details Visit

Featured

PostgreSQL

Advanced Open Source Database

Powerful, open-source object-relational database system known for reliability, feature robustness, and performance. Widely used in Python community with excellent support for advanced data types, JSON, full-text search, and performance optimization. ACID-compliant with strong community and enterprise adoption.

Free

4.8

Details Visit

Featured

MongoDB

Document NoSQL Database

Document database with scalability and flexibility, featuring querying and indexing capabilities. Stores data as JSON documents, making it ideal for rapid development and horizontal scaling. Supports aggregation pipelines, transactions, and has rich Python driver support with PyMongo.

Freemium

4.6

Details Visit

Featured

Redis

In-Memory Data Store

Open-source, in-memory data structure store used as database, cache, and message broker. Supports various data structures including strings, hashes, lists, sets, sorted sets, and streams. Provides high performance, sub-millisecond latency, and is widely used for caching, session management, and real-time analytics.

Free

4.7

Details Visit

Apache Cassandra

Distributed Wide-Column Store

Highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers with no single point of failure. Provides high availability and linear scalability. Ideal for applications requiring continuous availability and massive write throughput.

Free

4.3

Details Visit

InfluxDB

Time Series Database

Open-source time series database designed to handle high write and query loads for time-stamped data. Optimized for monitoring, IoT, analytics, and real-time applications. Features include retention policies, continuous queries, and InfluxQL for time-series specific operations.

Freemium

4.4

Details Visit

Featured

Elasticsearch

Distributed Search & Analytics

Distributed, RESTful search and analytics engine capable of addressing growing use cases. Commonly used for log analytics, full-text search, security intelligence, business analytics, and operational intelligence. Built on Apache Lucene with powerful aggregations and near real-time search.

Freemium

4.6

Details Visit

Apache Atlas

Enterprise Data Governance

Scalable and extensible set of core foundational governance services for Hadoop ecosystem and enterprise data. Enables organizations to effectively meet compliance requirements with metadata management, data classification, and lineage tracking. Integrates with Python through REST APIs for governance automation.

Free

4.2

Details Visit

Featured

Amundsen

Data Discovery & Metadata Engine

Data discovery and metadata engine for improving productivity of data analysts, scientists, and engineers when interacting with data. Provides powerful search, data previews, and column-level lineage. Integrates seamlessly with Python environments and modern data stacks for comprehensive metadata management.

Free

4.5

Details Visit

CKAN

Open Data Management System

Powerful data management system that makes data accessible by providing tools to streamline publishing, sharing, finding, and using data. Aimed at data publishers wanting to make their data open and available. Features data cataloging, API generation, and visualization capabilities.

Free

4.1

Details Visit

Marquez

Metadata Service for Data Lineage

Open-source metadata service for collection, aggregation, and visualization of data ecosystem metadata. Provides common interface to track data lineage across your entire data platform. Offers Python client for integration and supports OpenLineage standard for lineage collection.

Free

4.3

Details Visit

Featured

DataHub

Modern Metadata Platform

Open-source metadata platform for the modern data stack. Provides powerful and flexible metadata search, discovery, and lineage capabilities. Features real-time metadata updates, data quality monitoring, and governance workflows. Extensive Python SDK for automation and integration.

Free

4.6

Details Visit

Featured

Stack Overflow

Q&A for Data Engineers

Vast community of developers and IT professionals with extensive data engineering questions and answers. Rich resource for troubleshooting, learning from real-world problems, and discovering solutions. Active community providing quick responses to technical challenges in Python data engineering.

Free

4.7

Details Visit

Featured

Pandas

Data Analysis & Manipulation

Foundational library for data manipulation and analysis in Python. Provides fast, flexible, and expressive data structures (DataFrames) designed for working with structured, tabular, and time series data. Essential tool for data wrangling with comprehensive features for indexing, grouping, merging, and filtering.

Free

4.9

Details Visit

OpenRefine

Data Cleaning & Transformation

Powerful tool for working with messy data, cleaning it, transforming from one format to another, and extending it with web services or external data. Although not a Python library, it's valuable for advanced data wrangling alongside Python tools.

Free

4.5

Details Visit

ORM (encode/orm)

Lightweight Async ORM

Lightweight and async-ready ORM designed to work with FastAPI and Starlette. Particularly suited for applications requiring asynchronous database operations with minimal overhead and modern Python async/await patterns.

Free

4.3

Details Visit

Featured

Python

Programming Language

Python is a high-level, interpreted programming language that has become the dominant language for data engineering. Known for its clear syntax, extensive standard library, and rich ecosystem of data-focused packages. Essential foundation for all Python data engineering work.

Free

4.9

Details Visit

pip - getting-started tool for Python data engineering

pip

Python Package Installer

The standard package installer for Python. Used to install and manage Python packages from the Python Package Index (PyPI) and other repositories. Essential tool for managing dependencies in any Python project, comes bundled with Python installations.

Free

4.7

Details Visit

virtualenv / venv

Virtual Environment Manager

Tools for creating isolated Python environments, allowing you to manage project-specific dependencies without conflicts. venv comes built into Python 3, while virtualenv offers additional features. Critical for professional Python development and maintaining clean, reproducible environments.

Free

4.6

Details Visit

Docker Compose

Multi-Container Orchestration

Tool for defining and running multi-container Docker applications using YAML configuration files. Perfect for data engineering workflows that require multiple services like databases, message queues, and processing engines running together. Simplifies complex container setups into simple, version-controlled configurations.

Free

4.7

Details Visit

Datasets (5)

MusicBrainzAPI

An open-source database that collects information about music artists, releases, and tracks.

#rest-api #json #entertainment+3

API

OpenAQ API

Retrieve real-time and historical air quality data from locations around the world.

#rest-api #json #entertainment+5

API

OpenStreetMap API

Access open-source map data and perform geolocation services using OpenStreetMap.

#rest-api #json #entertainment+4

API

Hugging Face Datasets

A library that provides access to a wide range of datasets for natural language processing (NLP) tasks.

#csv #batch-processing #machine-learning+2

Download

Natural Earth Data

Natural Earth provides public domain map datasets at various scales, covering physical and cultural features such as coastlines, rivers, cities and political boundaries.

#csv #batch-processing #maps+3

Download

Tools (98)

Featured

Pandas

Data Manipulation & Analysis Library

Free

4.9

Details Visit

Petl

Python ETL Package

Python package specifically designed for ETL tasks, offering tools for data extraction, transformation, and loading. Suitable for simpler, script-based ETL processes.

Free

4.3

Details Visit

Featured

PySpark

Python API for Apache Spark

Python API for Apache Spark, enabling scalable and efficient data processing. Particularly useful for ETL processes involving large datasets that need parallel processing across a cluster.

Free

4.8

Details Visit

DLT (Data Load Tool)

Python Data Loading Library

Python library that facilitates the loading phase in ETL processes. Designed to simplify loading data into various data stores or processing systems.

Free

4.5

Details Visit

Featured

dbt (Data Build Tool)

Transform Data in Your Warehouse

Freemium

4.9

Details Visit

Bonobo

Lightweight ETL Framework

Lightweight Extract-Transform-Load (ETL) framework for Python 3.6+. Allows writing ETL scripts in pure Python, particularly suited for simple and straightforward ETL tasks.

Free

4.2

Details Visit

Mage.AI

Data Pipeline Tool

Modern data pipeline tool focused on automating data preparation and feature engineering for machine learning. Streamlines the data transformation process in ETL workflows.

Freemium

4.6

Details Visit

Featured

Apache Airflow

Workflow Orchestration Platform

Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.

Free

4.8

Details Visit

Luigi

Batch Job Pipeline Builder

Developed by Spotify, Luigi helps build complex pipelines of batch jobs, handling dependency resolution, workflow management, and task visualization.

Free

4.4

Details Visit

Apache NiFi

Data Flow Automation

Easy-to-use, powerful, and reliable system to process and distribute data, offering a web-based user interface for data flow management.

Free

4.5

Details Visit

Featured

Prefect

Modern Workflow Orchestration

Workflow management system designed for modern infrastructure, with a focus on simplicity, ease of use, and flexibility in defining and executing workflows.

Freemium

4.7

Details Visit

Featured

Dagster

Data Orchestrator for ML & Analytics

Open-source data orchestrator for machine learning, analytics, and ETL. Focuses on development, production, and observation of data assets with integrated pipeline views.

Freemium

4.7

Details Visit

Argo Workflows

Kubernetes-Native Workflow Engine

Open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Designed for large-scale computational tasks with powerful workflow features.

Free

4.6

Details Visit

Dask

Parallel Computing Library

Parallel computing library that scales Pandas workflows to larger-than-memory datasets. Enables parallel processing while maintaining a familiar Pandas-like interface for big data.

Free

4.6

Details Visit

Featured

NumPy

Numerical Computing Library

Fundamental library for numerical computing in Python. Supports large multi-dimensional arrays and matrices with a vast collection of mathematical functions for array operations.

Free

4.9

Details Visit

Beautiful Soup

Web Scraping & HTML Parsing

Library for web scraping and parsing HTML/XML documents. Extensively used in data wrangling to clean, parse, and extract data from web sources.

Free

4.5

Details Visit

Scrapy

Web Crawling Framework

Powerful web crawling and scraping framework for extracting, cleaning, and processing large volumes of web data. Essential for data wrangling from web sources.

Free

4.6

Details Visit

TextBlob

Text Processing Library

Simple library for processing textual data with APIs for common NLP tasks. Essential for data wrangling when dealing with text data and natural language processing.

Free

4.3

Details Visit

Featured

Pydantic

Data Validation using Type Hints

Data validation and settings management library using Python type annotations. Ensures data conforms to defined schemas with Python's typing module, perfect for FastAPI and modern Python apps.

Free

4.9

Details Visit

Featured

Marshmallow

Object Serialization & Validation

ORM/ODM/framework-agnostic library for object serialization and deserialization. Converts complex data types to and from native Python datatypes with robust validation.

Free

4.7

Details Visit

Cerberus

Lightweight Data Validation

Lightweight and extensible data validation library supporting complex data structures with customizable validation rules. Highly flexible for various validation needs.

Free

4.5

Details Visit

Voluptuous

Python Data Structure Validation

Validates Python data structures with straightforward syntax and clear error messages. Ensures structure and content adhere to specified schemas.

Free

4.3

Details Visit

jsonschema

JSON Schema Validator

Library for validating JSON data against JSON Schema standards. Essential when working with JSON data formats to ensure schema compliance.

Free

4.6

Details Visit

Featured

Pandera

DataFrame Validation

Flexible API for data validation on dataframe structures. Validates dataframes in real-time, integrates with pydantic and fastapi. Essential for production data pipelines.

Free

4.7

Details Visit

Validr

Fast Validation Library

Fast, simple, and powerful validation library with declarative validation rules. Optimized for performance when validating data from various sources.

Free

4.2

Details Visit

Featured

SQLAlchemy

Python SQL Toolkit & ORM

Widely used ORM library providing a full suite of enterprise-level persistence patterns. Designed for efficient, high-performing database access with flexible SQL abstraction.

Free

4.9

Details Visit

Featured

Django ORM

Django's Built-in ORM

Part of Django web framework, allows defining data models entirely in Python. Provides powerful abstraction layer to translate Python code to SQL seamlessly.

Free

4.8

Details Visit

Peewee

Small Expressive ORM

Small, expressive ORM with simple and intuitive interface. Lightweight and easy to use, perfect for small to medium-sized applications prioritizing simplicity.

Free

4.6

Details Visit

Pony ORM

Pythonic Query Language

Unique ORM using generator expressions for queries. Intuitive and user-friendly, allowing complex queries in pure Python that mirror human language.

Free

4.5

Details Visit

SQLObject

Object Interface to Database

Popular ORM providing object-oriented interface with tables as classes and rows as instances. Supports variety of database backends with simplicity focus.

Free

4.2

Details Visit

Tortoise ORM

Async ORM for Python

Easy-to-use asyncio ORM inspired by Django. Designed for async/await syntax, making it perfect for asynchronous applications and modern Python development.

Free

4.6

Details Visit

Gino

Async SQLAlchemy ORM

Async ORM built on SQLAlchemy core for asyncio programming. Provides simple and intuitive API for asynchronous database interactions with high performance.

Free

4.4

Details Visit

Featured

Alembic

Database Migrations for SQLAlchemy

Free

4.7

Details Visit

Flyway

Database Migration Tool

Free / Paid

4.6

Details Visit

Featured

Django Migrations

Built-in Django Migration Framework

Django's powerful built-in migration framework that comes bundled with Django. Allows you to change your database schema without losing data using a simple and intuitive API.

Free

4.8

Details Visit

Flask-Migrate

Database Migrations for Flask

Extension that handles SQLAlchemy database migrations for Flask applications using Alembic. Provides command-line tools to manage and automate database migrations in Flask projects.

Free

4.5

Details Visit

yoyo-migrations

Database Schema Migration Tool

Free

4.3

Details Visit

SQLAlchemy-Migrate

Schema Versioning for SQLAlchemy

Provides a way to deal with database schema changes in SQLAlchemy projects. Extends SQLAlchemy to have database schema versioning and migration capabilities for managing database evolution.

Free

4.2

Details Visit

South

Legacy Django Migrations

The original migration tool for Django before built-in migrations were added in Django 1.7. Still relevant for maintaining or upgrading legacy Django applications running older versions.

Free

Details Visit

Featured

Apache Kafka

Distributed Event Streaming Platform

Free

4.8

Details Visit

Featured

Apache Flink

Stream Processing Framework

Framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Known for high performance in streaming data processing with exactly-once semantics.

Free

4.7

Details Visit

Apache Storm

Real-Time Computation System

Real-time computation system making it easy to process unbounded streams of data reliably. Fast and scalable distributed real-time computation framework for stream processing.

Free

4.4

Details Visit

Faust

Python Stream Processing

Stream processing library porting ideas from Kafka Streams to Python. Used for building high-performance and reliable real-time stream processing applications with Pythonic API.

Free

4.5

Details Visit

Apache Spark Streaming

Scalable Stream Processing

Extension of Apache Spark API enabling scalable, high-throughput, fault-tolerant processing of live data streams. Integrated within Spark ecosystem for complex real-time data processing tasks.

Free

4.6

Details Visit

Redpanda

Modern Streaming Platform

Streaming data platform API-compatible with Apache Kafka but designed for better performance and easier operational management. Modern streaming platform for mission-critical workloads.

Free / Paid

4.6

Details Visit

Featured

Flask

Lightweight Web Framework

Lightweight WSGI web application framework easy to get started with and versatile for complex applications. Popular for building web APIs thanks to simplicity and extensibility.

Free

4.8

Details Visit

Featured

Django REST Framework

Powerful API Toolkit for Django

Powerful and flexible toolkit for building Web APIs in Django. Highly recommended for adding API capabilities to Django applications with comprehensive features and excellent documentation.

Free

4.9

Details Visit

Featured

FastAPI

Modern High-Performance Framework

Modern, fast (high-performance) web framework for building APIs with Python 3.7+ based on standard type hints. Features automatic API documentation, easy to use, and blazing fast execution.

Free

4.9

Details Visit

Tornado

Asynchronous Networking Library

Python web framework and asynchronous networking library. Particularly useful for long-polling, WebSockets, and applications requiring long-lived connections to each user.

Free

4.5

Details Visit

Falcon

High-Performance Python Framework

Reliable, high-performance Python framework for building large-scale app backends and microservices. Encourages REST architectural style while remaining highly effective and minimalist.

Free

4.6

Details Visit

Featured

Great Expectations

Data Validation & Documentation

Comprehensive tool helping data teams validate, document, and profile their data. Define expectations for your data ensuring it meets quality standards before processing.

Free / Paid

4.7

Details Visit

Ydata Profiling

Automated Data Profiling

Generates profile reports from pandas DataFrames. Excellent tool for quickly understanding data with interactive HTML reports including statistics, distributions, and correlations.

Free

4.6

Details Visit

PyDeequ

Data Quality for Big Data

Python API for Deequ, AWS library built on Apache Spark for defining and verifying data quality constraints. Useful for large-scale data processing and quality verification.

Free

4.5

Details Visit

Dedupe

ML-Powered Deduplication

Python library using machine learning to perform deduplication and entity resolution on structured data. Particularly useful for identifying and merging duplicate records.

Free

4.4

Details Visit

Soda Core

Data Quality Testing

Open-source data quality tool with CLI for defining, running, and monitoring data quality checks. Write tests to verify data meets conditions like missing values, ranges, or uniqueness.

Free / Paid

4.6

Details Visit

DataCleaner

Automated Data Cleaning

Automatic tool for cleaning and preprocessing data. Handles missing values, encodes categorical data, and scales features making data preparation efficient.

Free

4.2

Details Visit

Data Linter

Schema Validation Tool

Python package for automated data validation within Data Engineering pipelines. Engineered to ingest and validate tabular data against predefined schemas.

Free

4.1

Details Visit

Featured

Matplotlib

Comprehensive Visualization Library

Free

4.8

Details Visit

Featured

Seaborn

Statistical Data Visualization

Free

4.7

Details Visit

Featured

Plotly

Interactive Visualization Library

Plotly offers a range of interactive plotting options and is known for its advanced graphics and interactivity, supporting complex visualizations with ease. Perfect for creating web-based dashboards.

Free / Paid

4.8

Details Visit

Bokeh

Interactive Web Visualizations

Bokeh focuses on building interactive, web-ready plots, which can be a powerful tool for creating dynamic visualizations that can easily be embedded in web applications.

Free

4.6

Details Visit

Altair

Declarative Visualization

Altair is a declarative statistical visualization library for Python, offering a simple and concise way to create a wide range of statistical plots based on a logical data mapping.

Free

4.5

Details Visit

Featured

Scikit-learn

Machine Learning in Python

Versatile library providing a range of supervised and unsupervised learning algorithms. Known for its ease of use and efficiency for data mining and data analysis with classical ML algorithms.

Free

4.9

Details Visit

Featured

TensorFlow

End-to-End ML Platform

End-to-end open-source platform for machine learning enabling complex computations with data flow graphs. Widely used for deep learning applications with robust production support.

Free

4.8

Details Visit

Featured

PyTorch

Deep Learning Framework

Open-source machine learning library known for its flexibility, ease of use, and as a preferred tool for research in deep learning and artificial intelligence. Dynamic computation graphs.

Free

4.8

Details Visit

Keras

High-Level Neural Networks API

High-level neural networks API designed for fast experimentation with deep neural networks. Runs on top of TensorFlow offering a user-friendly interface for building models.

Free

4.7

Details Visit

Featured

XGBoost

Extreme Gradient Boosting

Highly efficient implementation of gradient boosting frameworks designed for speed and performance. Widely used in machine learning competitions and practical applications for structured data.

Free

4.8

Details Visit

LightGBM

Light Gradient Boosting Machine

Gradient boosting framework using tree-based learning algorithms. Designed for speed and efficiency, supporting large datasets and distributed computing for various ML tasks.

Free

4.7

Details Visit

CatBoost

Gradient Boosting on Decision Trees

Algorithm for gradient boosting on decision trees developed by Yandex. Particularly effective for datasets with categorical features, known for robustness and handling overfitting well.

Free

4.6

Details Visit

Apache Hadoop

Distributed Storage and Processing Framework

Free

4.2

Details Visit

Featured

Apache Beam

Unified Batch and Stream Processing

Free

4.5

Details Visit

Featured

Boto3

AWS SDK for Python

Free

4.8

Details Visit

Featured

Google Cloud Client Libraries

GCP SDK for Python

Free

4.7

Details Visit

Azure SDK for Python

Microsoft Azure SDK

Free

4.6

Details Visit

IBM Cloud Python SDK

IBM Cloud Services SDK

Free

4.3

Details Visit

Oracle Cloud Infrastructure SDK

OCI SDK for Python

Free

4.4

Details Visit

Featured

MySQL Workbench

MySQL Database Design Tool

Free

4.4

Details Visit

ERAlchemy

ER Diagrams from SQLAlchemy

Free

4.2

Details Visit

Dia

Open Source Diagramming

Free

3.8

Details Visit

Featured

PostgreSQL

Advanced Open Source Database

Free

4.8

Details Visit

Featured

MongoDB

Document NoSQL Database

Freemium

4.6

Details Visit

Featured

Redis

In-Memory Data Store

Free

4.7

Details Visit

Apache Cassandra

Distributed Wide-Column Store

Free

4.3

Details Visit

InfluxDB

Time Series Database

Freemium

4.4

Details Visit

Featured

Elasticsearch

Distributed Search & Analytics

Freemium

4.6

Details Visit

Apache Atlas

Enterprise Data Governance

Free

4.2

Details Visit

Featured

Amundsen

Data Discovery & Metadata Engine

Free

4.5

Details Visit

CKAN

Open Data Management System

Free

4.1

Details Visit

Marquez

Metadata Service for Data Lineage

Free

4.3

Details Visit

Featured

DataHub

Modern Metadata Platform

Free

4.6

Details Visit

Featured

Stack Overflow

Q&A for Data Engineers

Free

4.7

Details Visit

Featured

Pandas

Data Analysis & Manipulation

Free

4.9

Details Visit

OpenRefine

Data Cleaning & Transformation

Free

4.5

Details Visit

ORM (encode/orm)

Lightweight Async ORM

Free

4.3

Details Visit

Featured

Python

Programming Language

Free

4.9

Details Visit

pip

Python Package Installer

Free

4.7

Details Visit

virtualenv / venv

Virtual Environment Manager

Free

4.6

Details Visit

Docker Compose

Multi-Container Orchestration

Free

4.7

Details Visit

Datasets (5)

MusicBrainzAPI

An open-source database that collects information about music artists, releases, and tracks.

#rest-api #json #entertainment+3

API

OpenAQ API

Retrieve real-time and historical air quality data from locations around the world.

#rest-api #json #entertainment+5

API

OpenStreetMap API

Access open-source map data and perform geolocation services using OpenStreetMap.

#rest-api #json #entertainment+4

API

Hugging Face Datasets

A library that provides access to a wide range of datasets for natural language processing (NLP) tasks.

#csv #batch-processing #machine-learning+2

Download

Natural Earth Data

Natural Earth provides public domain map datasets at various scales, covering physical and cultural features such as coastlines, rivers, cities and political boundaries.

#csv #batch-processing #maps+3

Download