Explore 98 tools and 5 datasets tagged with Open Source for Python data engineering.
Python Data Loading Library
Python library that facilitates the loading phase in ETL processes. Designed to simplify loading data into various data stores or processing systems.
Transform Data in Your Warehouse
Open-source transformation tool enabling data analysts and engineers to transform, test, and document data in the warehouse. Focuses on the transform part of ETL with SQL templating and Python scripting.
Workflow Orchestration Platform
Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.
Data Flow Automation
Easy-to-use, powerful, and reliable system to process and distribute data, offering a web-based user interface for data flow management.
Kubernetes-Native Workflow Engine
Open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Designed for large-scale computational tasks with powerful workflow features.
Web Scraping & HTML Parsing
Library for web scraping and parsing HTML/XML documents. Extensively used in data wrangling to clean, parse, and extract data from web sources.
Object Serialization & Validation
ORM/ODM/framework-agnostic library for object serialization and deserialization. Converts complex data types to and from native Python datatypes with robust validation.
Python Data Structure Validation
Validates Python data structures with straightforward syntax and clear error messages. Ensures structure and content adhere to specified schemas.
JSON Schema Validator
Library for validating JSON data against JSON Schema standards. Essential when working with JSON data formats to ensure schema compliance.
Python SQL Toolkit & ORM
Widely used ORM library providing a full suite of enterprise-level persistence patterns. Designed for efficient, high-performing database access with flexible SQL abstraction.
Django's Built-in ORM
Part of Django web framework, allows defining data models entirely in Python. Provides powerful abstraction layer to translate Python code to SQL seamlessly.
Async ORM for Python
Easy-to-use asyncio ORM inspired by Django. Designed for async/await syntax, making it perfect for asynchronous applications and modern Python development.
Built-in Django Migration Framework
Django's powerful built-in migration framework that comes bundled with Django. Allows you to change your database schema without losing data using a simple and intuitive API.
Database Migrations for Flask
Extension that handles SQLAlchemy database migrations for Flask applications using Alembic. Provides command-line tools to manage and automate database migrations in Flask projects.
Database Schema Migration Tool
Database schema migration tool that lets you manage your database schema by applying and rolling back migration scripts written in pure SQL or Python. Simple and flexible approach to database migrations.
Schema Versioning for SQLAlchemy
Provides a way to deal with database schema changes in SQLAlchemy projects. Extends SQLAlchemy to have database schema versioning and migration capabilities for managing database evolution.
Distributed Event Streaming Platform
Distributed event streaming platform capable of handling trillions of events a day. Used for building real-time streaming data pipelines and applications with high-throughput, fault-tolerance, and scalability.
Stream Processing Framework
Framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Known for high performance in streaming data processing with exactly-once semantics.
Real-Time Computation System
Real-time computation system making it easy to process unbounded streams of data reliably. Fast and scalable distributed real-time computation framework for stream processing.
Scalable Stream Processing
Extension of Apache Spark API enabling scalable, high-throughput, fault-tolerant processing of live data streams. Integrated within Spark ecosystem for complex real-time data processing tasks.
Powerful API Toolkit for Django
Powerful and flexible toolkit for building Web APIs in Django. Highly recommended for adding API capabilities to Django applications with comprehensive features and excellent documentation.
Data Validation & Documentation
Comprehensive tool helping data teams validate, document, and profile their data. Define expectations for your data ensuring it meets quality standards before processing.
Automated Data Profiling
Generates profile reports from pandas DataFrames. Excellent tool for quickly understanding data with interactive HTML reports including statistics, distributions, and correlations.
Automated Data Cleaning
Automatic tool for cleaning and preprocessing data. Handles missing values, encodes categorical data, and scales features making data preparation efficient.
Schema Validation Tool
Python package for automated data validation within Data Engineering pipelines. Engineered to ingest and validate tabular data against predefined schemas.
Comprehensive Visualization Library
Comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib is versatile and widely used for plotting graphs and charts with extensive customization options.
Machine Learning in Python
Versatile library providing a range of supervised and unsupervised learning algorithms. Known for its ease of use and efficiency for data mining and data analysis with classical ML algorithms.
End-to-End ML Platform
End-to-end open-source platform for machine learning enabling complex computations with data flow graphs. Widely used for deep learning applications with robust production support.
Distributed Storage and Processing Framework
Framework that allows for distributed processing of large datasets across clusters of computers using simple programming models. Designed to scale from single servers to thousands of machines, each offering local computation and storage. Uses HDFS for distributed storage and MapReduce for processing.
Unified Batch and Stream Processing
Advanced unified programming model for defining and executing data processing workflows that can run on any execution engine. Provides portability across multiple execution environments including Apache Flink, Apache Spark, and Google Cloud Dataflow. Ideal for building flexible, scalable data pipelines.
AWS SDK for Python
The official Amazon Web Services (AWS) SDK for Python. Enables Python developers to write software that makes use of services like Amazon S3, EC2, Lambda, and more. Provides easy-to-use, object-oriented API as well as low-level access to AWS services, making it simple to integrate Python applications with AWS infrastructure.
GCP SDK for Python
Google Cloud Platform's official client library for Python, enabling seamless integration with GCP services like Compute Engine, Cloud Storage, BigQuery, and Pub/Sub. Designed for a Pythonic, intuitive experience when interacting with Google Cloud services, with idiomatic code patterns and comprehensive documentation.
Microsoft Azure SDK
Microsoft's comprehensive Azure SDK for Python offering a complete set of packages to interact with Azure resources and services. Supports wide range of Azure services including Virtual Machines, Storage, Databases, AI services, and more. Provides tools for effective resource management and service interaction within Azure ecosystem.
IBM Cloud Services SDK
Official SDK for interacting with various IBM Cloud services programmatically. Provides comprehensive support for IBM Cloud services including CIS, DNS, IAM, VPC, Watson AI, and more. Enables management and automation of IBM Cloud resources with Python, compatible with Python 3.6 and above.
OCI SDK for Python
Official SDK for writing code to manage Oracle Cloud Infrastructure resources. Supports wide range of Oracle Cloud services with functionalities for compute, storage, networking, databases, and more. Available across multiple operating systems and Python versions, providing robust interface for OCI resource management.
MySQL Database Design Tool
Integrated tool provided by MySQL for database design, modeling, administration, and maintenance. Provides visual interface for creating, managing, and analyzing MySQL databases. Includes data modeling, SQL development, and comprehensive administration tools for MySQL database systems.
ER Diagrams from SQLAlchemy
Python library designed to create Entity Relationship diagrams by extracting data from databases or SQLAlchemy models. Particularly useful for database designers and developers who need to visualize and interpret complex relationships within database systems. Generates diagrams automatically from your Python code.
Open Source Diagramming
Free and open-source diagramming tool that can be used to create Entity-Relationship diagrams. Versatile application suitable for simple modeling tasks, flowcharts, network diagrams, and database schemas. Lightweight alternative for developers who need basic ER diagram functionality.
Advanced Open Source Database
Powerful, open-source object-relational database system known for reliability, feature robustness, and performance. Widely used in Python community with excellent support for advanced data types, JSON, full-text search, and performance optimization. ACID-compliant with strong community and enterprise adoption.
Document NoSQL Database
Document database with scalability and flexibility, featuring querying and indexing capabilities. Stores data as JSON documents, making it ideal for rapid development and horizontal scaling. Supports aggregation pipelines, transactions, and has rich Python driver support with PyMongo.
In-Memory Data Store
Open-source, in-memory data structure store used as database, cache, and message broker. Supports various data structures including strings, hashes, lists, sets, sorted sets, and streams. Provides high performance, sub-millisecond latency, and is widely used for caching, session management, and real-time analytics.
Distributed Wide-Column Store
Highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers with no single point of failure. Provides high availability and linear scalability. Ideal for applications requiring continuous availability and massive write throughput.
Time Series Database
Open-source time series database designed to handle high write and query loads for time-stamped data. Optimized for monitoring, IoT, analytics, and real-time applications. Features include retention policies, continuous queries, and InfluxQL for time-series specific operations.
Distributed Search & Analytics
Distributed, RESTful search and analytics engine capable of addressing growing use cases. Commonly used for log analytics, full-text search, security intelligence, business analytics, and operational intelligence. Built on Apache Lucene with powerful aggregations and near real-time search.
Enterprise Data Governance
Scalable and extensible set of core foundational governance services for Hadoop ecosystem and enterprise data. Enables organizations to effectively meet compliance requirements with metadata management, data classification, and lineage tracking. Integrates with Python through REST APIs for governance automation.
Data Discovery & Metadata Engine
Data discovery and metadata engine for improving productivity of data analysts, scientists, and engineers when interacting with data. Provides powerful search, data previews, and column-level lineage. Integrates seamlessly with Python environments and modern data stacks for comprehensive metadata management.
Open Data Management System
Powerful data management system that makes data accessible by providing tools to streamline publishing, sharing, finding, and using data. Aimed at data publishers wanting to make their data open and available. Features data cataloging, API generation, and visualization capabilities.
Metadata Service for Data Lineage
Open-source metadata service for collection, aggregation, and visualization of data ecosystem metadata. Provides common interface to track data lineage across your entire data platform. Offers Python client for integration and supports OpenLineage standard for lineage collection.
Modern Metadata Platform
Open-source metadata platform for the modern data stack. Provides powerful and flexible metadata search, discovery, and lineage capabilities. Features real-time metadata updates, data quality monitoring, and governance workflows. Extensive Python SDK for automation and integration.
Q&A for Data Engineers
Vast community of developers and IT professionals with extensive data engineering questions and answers. Rich resource for troubleshooting, learning from real-world problems, and discovering solutions. Active community providing quick responses to technical challenges in Python data engineering.
Data Analysis & Manipulation
Foundational library for data manipulation and analysis in Python. Provides fast, flexible, and expressive data structures (DataFrames) designed for working with structured, tabular, and time series data. Essential tool for data wrangling with comprehensive features for indexing, grouping, merging, and filtering.
Data Cleaning & Transformation
Powerful tool for working with messy data, cleaning it, transforming from one format to another, and extending it with web services or external data. Although not a Python library, it's valuable for advanced data wrangling alongside Python tools.
Lightweight Async ORM
Lightweight and async-ready ORM designed to work with FastAPI and Starlette. Particularly suited for applications requiring asynchronous database operations with minimal overhead and modern Python async/await patterns.
Programming Language
Python is a high-level, interpreted programming language that has become the dominant language for data engineering. Known for its clear syntax, extensive standard library, and rich ecosystem of data-focused packages. Essential foundation for all Python data engineering work.
Virtual Environment Manager
Tools for creating isolated Python environments, allowing you to manage project-specific dependencies without conflicts. venv comes built into Python 3, while virtualenv offers additional features. Critical for professional Python development and maintaining clean, reproducible environments.
Multi-Container Orchestration
Tool for defining and running multi-container Docker applications using YAML configuration files. Perfect for data engineering workflows that require multiple services like databases, message queues, and processing engines running together. Simplifies complex container setups into simple, version-controlled configurations.
An open-source database that collects information about music artists, releases, and tracks.
Retrieve real-time and historical air quality data from locations around the world.
Access open-source map data and perform geolocation services using OpenStreetMap.
A library that provides access to a wide range of datasets for natural language processing (NLP) tasks.
Natural Earth provides public domain map datasets at various scales, covering physical and cultural features such as coastlines, rivers, cities and political boundaries.