Discover 102 tools tagged with Python for Python data engineering.
Python Data Loading Library
Python library that facilitates the loading phase in ETL processes. Designed to simplify loading data into various data stores or processing systems.
Workflow Orchestration Platform
Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.
Web Scraping & HTML Parsing
Library for web scraping and parsing HTML/XML documents. Extensively used in data wrangling to clean, parse, and extract data from web sources.
Object Serialization & Validation
ORM/ODM/framework-agnostic library for object serialization and deserialization. Converts complex data types to and from native Python datatypes with robust validation.
Python Data Structure Validation
Validates Python data structures with straightforward syntax and clear error messages. Ensures structure and content adhere to specified schemas.
JSON Schema Validator
Library for validating JSON data against JSON Schema standards. Essential when working with JSON data formats to ensure schema compliance.
Python SQL Toolkit & ORM
Widely used ORM library providing a full suite of enterprise-level persistence patterns. Designed for efficient, high-performing database access with flexible SQL abstraction.
Django's Built-in ORM
Part of Django web framework, allows defining data models entirely in Python. Provides powerful abstraction layer to translate Python code to SQL seamlessly.
Async ORM for Python
Easy-to-use asyncio ORM inspired by Django. Designed for async/await syntax, making it perfect for asynchronous applications and modern Python development.
Built-in Django Migration Framework
Django's powerful built-in migration framework that comes bundled with Django. Allows you to change your database schema without losing data using a simple and intuitive API.
Database Migrations for Flask
Extension that handles SQLAlchemy database migrations for Flask applications using Alembic. Provides command-line tools to manage and automate database migrations in Flask projects.
Database Schema Migration Tool
Database schema migration tool that lets you manage your database schema by applying and rolling back migration scripts written in pure SQL or Python. Simple and flexible approach to database migrations.
Schema Versioning for SQLAlchemy
Provides a way to deal with database schema changes in SQLAlchemy projects. Extends SQLAlchemy to have database schema versioning and migration capabilities for managing database evolution.
Powerful API Toolkit for Django
Powerful and flexible toolkit for building Web APIs in Django. Highly recommended for adding API capabilities to Django applications with comprehensive features and excellent documentation.
Data Validation & Documentation
Comprehensive tool helping data teams validate, document, and profile their data. Define expectations for your data ensuring it meets quality standards before processing.
Automated Data Profiling
Generates profile reports from pandas DataFrames. Excellent tool for quickly understanding data with interactive HTML reports including statistics, distributions, and correlations.
Automated Data Cleaning
Automatic tool for cleaning and preprocessing data. Handles missing values, encodes categorical data, and scales features making data preparation efficient.
Schema Validation Tool
Python package for automated data validation within Data Engineering pipelines. Engineered to ingest and validate tabular data against predefined schemas.
Comprehensive Visualization Library
Comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib is versatile and widely used for plotting graphs and charts with extensive customization options.
Machine Learning in Python
Versatile library providing a range of supervised and unsupervised learning algorithms. Known for its ease of use and efficiency for data mining and data analysis with classical ML algorithms.
End-to-End ML Platform
End-to-end open-source platform for machine learning enabling complex computations with data flow graphs. Widely used for deep learning applications with robust production support.
AWS SDK for Python
The official Amazon Web Services (AWS) SDK for Python. Enables Python developers to write software that makes use of services like Amazon S3, EC2, Lambda, and more. Provides easy-to-use, object-oriented API as well as low-level access to AWS services, making it simple to integrate Python applications with AWS infrastructure.
GCP SDK for Python
Google Cloud Platform's official client library for Python, enabling seamless integration with GCP services like Compute Engine, Cloud Storage, BigQuery, and Pub/Sub. Designed for a Pythonic, intuitive experience when interacting with Google Cloud services, with idiomatic code patterns and comprehensive documentation.
Microsoft Azure SDK
Microsoft's comprehensive Azure SDK for Python offering a complete set of packages to interact with Azure resources and services. Supports wide range of Azure services including Virtual Machines, Storage, Databases, AI services, and more. Provides tools for effective resource management and service interaction within Azure ecosystem.
IBM Cloud Services SDK
Official SDK for interacting with various IBM Cloud services programmatically. Provides comprehensive support for IBM Cloud services including CIS, DNS, IAM, VPC, Watson AI, and more. Enables management and automation of IBM Cloud resources with Python, compatible with Python 3.6 and above.
OCI SDK for Python
Official SDK for writing code to manage Oracle Cloud Infrastructure resources. Supports wide range of Oracle Cloud services with functionalities for compute, storage, networking, databases, and more. Available across multiple operating systems and Python versions, providing robust interface for OCI resource management.
ER Diagrams from SQLAlchemy
Python library designed to create Entity Relationship diagrams by extracting data from databases or SQLAlchemy models. Particularly useful for database designers and developers who need to visualize and interpret complex relationships within database systems. Generates diagrams automatically from your Python code.
Data Discovery & Metadata Engine
Data discovery and metadata engine for improving productivity of data analysts, scientists, and engineers when interacting with data. Provides powerful search, data previews, and column-level lineage. Integrates seamlessly with Python environments and modern data stacks for comprehensive metadata management.
Open Data Management System
Powerful data management system that makes data accessible by providing tools to streamline publishing, sharing, finding, and using data. Aimed at data publishers wanting to make their data open and available. Features data cataloging, API generation, and visualization capabilities.
Metadata Service for Data Lineage
Open-source metadata service for collection, aggregation, and visualization of data ecosystem metadata. Provides common interface to track data lineage across your entire data platform. Offers Python client for integration and supports OpenLineage standard for lineage collection.
Modern Metadata Platform
Open-source metadata platform for the modern data stack. Provides powerful and flexible metadata search, discovery, and lineage capabilities. Features real-time metadata updates, data quality monitoring, and governance workflows. Extensive Python SDK for automation and integration.
Data Analysis & Manipulation
Foundational library for data manipulation and analysis in Python. Provides fast, flexible, and expressive data structures (DataFrames) designed for working with structured, tabular, and time series data. Essential tool for data wrangling with comprehensive features for indexing, grouping, merging, and filtering.
Lightweight Async ORM
Lightweight and async-ready ORM designed to work with FastAPI and Starlette. Particularly suited for applications requiring asynchronous database operations with minimal overhead and modern Python async/await patterns.
Programming Language
Python is a high-level, interpreted programming language that has become the dominant language for data engineering. Known for its clear syntax, extensive standard library, and rich ecosystem of data-focused packages. Essential foundation for all Python data engineering work.
Virtual Environment Manager
Tools for creating isolated Python environments, allowing you to manage project-specific dependencies without conflicts. venv comes built into Python 3, while virtualenv offers additional features. Critical for professional Python development and maintaining clean, reproducible environments.
Scalable Machine Learning Platform
A fast, scalable, open-source machine learning and artificial intelligence platform. H2O supports widely used statistical and machine learning algorithms including gradient boosted machines, random forests, deep learning, and more with Python and R APIs.
Spark's Machine Learning Library
Apache Spark's scalable machine learning library consisting of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, and dimensionality reduction. MLlib integrates seamlessly with Spark's data processing pipelines.
Python Data Pipeline Framework
An open-source Python framework for creating reproducible, maintainable, and modular data science code. Kedro applies software engineering best practices to data pipelines with built-in data catalog, pipeline visualization, and experiment tracking.
Data Transformation Framework
An open-source data transformation framework for managing, testing, and deploying SQL and Python-based data pipelines. SQLMesh provides virtual data environments, automatic change detection, and incremental processing for efficient data warehouse management.
Modern BI Web Application
A modern, enterprise-ready business intelligence web application. Superset provides an intuitive interface for creating interactive dashboards, exploring data through SQL, and building rich visualizations without writing code.
AWS Data Utility Belt for Python
A utility belt for handling data on AWS using Python. AWS Data Wrangler extends Pandas with connectors to AWS services like S3, Glue, Athena, Redshift, and more, simplifying data ingestion and extraction in AWS-based pipelines.
Delimited Data Preboarding
A delimited data preboarding framework that fills the gap between managed file transfer and the data lake. CsvPath provides a domain-specific language for validating, transforming, and routing CSV and other delimited files before ingestion.
Columnar Storage Format
A columnar storage format available to any project in the Hadoop ecosystem. Parquet provides efficient compression and encoding schemes, making it the de facto standard for analytical workloads in data lakes and warehouses.
Google's Data Interchange Format
Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. Protocol Buffers provide a compact binary format with strong typing and schema evolution, widely used in gRPC and high-performance data systems.
Sensitive Data Detection & Profiling
A Python library by Capital One designed to make data analysis, monitoring, and sensitive data detection easy. Data Profiler automatically identifies data types, statistical patterns, and PII across structured and unstructured datasets.
Advanced Data Pattern Discovery
An open-source data profiler focused on discovery and validation of complex patterns in data. Desbordante finds functional dependencies, association rules, and other data constraints that go beyond basic statistical profiling.