Distributed Tools & Datasets for Python Data Engineering

Discover 50 tools tagged with Distributed for Python data engineering.

Tools (50)

Featured

PySpark

Python API for Apache Spark

Python API for Apache Spark, enabling scalable and efficient data processing. Particularly useful for ETL processes involving large datasets that need parallel processing across a cluster.

Free

4.8

Details Visit

Featured

Apache Airflow

Workflow Orchestration Platform

Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.

Free

4.8

Details Visit

Argo Workflows

Kubernetes-Native Workflow Engine

Open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Designed for large-scale computational tasks with powerful workflow features.

Free

4.6

Details Visit

Dask

Parallel Computing Library

Parallel computing library that scales Pandas workflows to larger-than-memory datasets. Enables parallel processing while maintaining a familiar Pandas-like interface for big data.

Free

4.6

Details Visit

CrateDB - databases-warehouses tool for Python data engineering

A scalable SQL database that combines the familiarity of SQL with the scalability of NoSQL. CrateDB is optimized for machine data and IoT workloads, offering real-time analytics on large volumes of data with a distributed architecture.

Freemium

4.2

Details Visit

RQLite

Distributed SQLite Database

A lightweight, distributed relational database built on SQLite and using the Raft consensus protocol. RQLite provides fault-tolerant, replicated SQLite with an easy-to-use HTTP API, ideal for edge computing and embedded applications.

Free

4.1

Details Visit

Riak

Distributed Key-Value Store

A distributed NoSQL database designed to deliver maximum data availability by distributing data across multiple servers. Riak offers automatic data replication, fault tolerance, and near-linear scalability for read and write operations.

Free

3.8

Details Visit

Apache HBase

Distributed Column-Family Store

A distributed, scalable big data store modeled after Google's Bigtable, running on top of HDFS. HBase provides random, real-time read/write access to large datasets and is commonly used for storing sparse data in the Hadoop ecosystem.

Free

4.2

Details Visit

ScyllaDB

High-Performance Cassandra Alternative

A NoSQL database compatible with Apache Cassandra but built in C++ for significantly higher throughput and lower latency. ScyllaDB is designed for data-intensive applications requiring consistent single-digit millisecond performance at scale.

Freemium

4.5

Details Visit

Featured

ClickHouse

Fast Columnar OLAP Database

An open-source columnar database management system designed for online analytical processing (OLAP). ClickHouse delivers exceptional query performance on large datasets, making it ideal for real-time analytics, log analysis, and time-series data.

Freemium

4.7

Details Visit

Vertica

Enterprise Columnar Analytics Database

A distributed, MPP columnar database designed for large-scale analytics workloads. Vertica offers extensive analytics SQL, machine learning capabilities, and high compression ratios for efficient storage of massive datasets.

$$$

4.3

Details Visit

FiloDB

Distributed Columnar Streaming Database

A distributed, columnar, versioned, and streaming database designed for real-time and batch analytics. FiloDB combines the benefits of columnar storage with streaming ingestion, making it suitable for time-series and event data workloads.

Free

3.7

Details Visit

Couchbase

Distributed NoSQL Cloud Database

A distributed NoSQL cloud database that delivers unmatched versatility, performance, scalability, and financial value for modern applications. Couchbase supports key-value, document, and SQL-like queries with built-in full-text search and analytics.

Freemium

4.3

Details Visit

RavenDB

Transactional NoSQL Document Database

A fully transactional NoSQL document database that provides ACID guarantees across the entire cluster. RavenDB offers automatic indexing, built-in full-text search, and a distributed architecture with multi-master replication.

Freemium

4.1

Details Visit

ArangoDB

Multi-Model Graph Database

A distributed, open-source database with a flexible data model supporting documents, graphs, and key-values in a single engine. ArangoDB provides a unified query language (AQL) that allows joining data across different models.

Freemium

4.3

Details Visit

Titan

Scalable Graph Database

A scalable graph database optimized for storing and querying large graphs with billions of vertices and edges across a multi-machine cluster. Titan supports various storage backends including Cassandra, HBase, and BerkeleyDB.

Free

3.6

Details Visit

Apache Geode

Distributed In-Memory Database

An open-source, distributed, in-memory database providing reliable asynchronous event notifications and guaranteed message delivery. Apache Geode pools memory, CPU, network resources, and local disk storage across multiple processes for high-performance data management.

Free

3.8

Details Visit

Datomic

Immutable Transactional Database

A fully transactional, cloud-ready, distributed database built on immutable data and designed for value-centric programming. Datomic stores facts with full history, enabling time-travel queries and simplified application architecture.

Details Visit

OpenTSDB

Scalable Time Series Database

A scalable, distributed time series database built on HBase. OpenTSDB stores, indexes, and serves metrics collected from various sources at massive scale, making it ideal for monitoring infrastructure and IoT data collection.

Free

3.9

Details Visit

Apache Druid

Real-Time Analytics Database

A column-oriented, distributed data store designed for sub-second OLAP queries on event data. Druid is used for powering interactive analytical applications, real-time dashboards, and exploratory analytics on high-cardinality data.

Free

4.3

Details Visit

Greenplum

Open Source Data Warehouse

An advanced, fully featured, open source data warehouse based on PostgreSQL. Greenplum provides powerful and rapid analytics on petabyte-scale data volumes with massively parallel processing (MPP) architecture.

Free

4.2

Details Visit

VoltDB

In-Memory ACID SQL Database

An in-memory, ACID-compliant relational database that uses a shared-nothing architecture for horizontal scalability. VoltDB is designed for applications requiring high-velocity data ingestion with real-time decisioning and analytics.

Freemium

3.9

Details Visit

Apache Samza

Distributed Stream Processing Framework

A distributed stream processing framework that uses Apache Kafka for messaging and Apache Hadoop YARN for fault tolerance, processor isolation, security, and resource management. Samza provides a simple API for building stateful stream processing applications.

Free

Details Visit

Featured

Apache Hudi

Incremental Data Processing Framework

An open-source framework for managing storage for real-time data processing on top of data lakes. Hudi provides record-level insert, update, and delete capabilities along with change streams, enabling incremental data pipelines on large-scale datasets.

Free

4.4

Details Visit

AWS EMR

Managed Big Data Platform

A cloud-based big data platform from AWS that makes it easy to process vast amounts of data using open-source tools such as Apache Spark, Hadoop, Hive, and Presto. EMR handles cluster provisioning, configuration, and tuning automatically.

4.5

Details Visit

Apache Tez

DAG-Based Processing Framework

An application framework for complex directed-acyclic-graph (DAG) based data processing tasks, built on top of Apache Hadoop YARN. Tez generalizes MapReduce to enable more efficient data processing pipelines with fewer read/write cycles.

Free

Details Visit

Featured

Presto

Distributed SQL Query Engine

A distributed SQL query engine designed to query large datasets distributed over one or more heterogeneous data sources. Presto enables interactive analytics on petabytes of data across data lakes, warehouses, and databases using standard SQL.

Free

4.5

Details Visit

Apache Hive

Data Warehouse on Hadoop

A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides a SQL-like interface (HiveQL) for querying data stored in Hadoop's HDFS and other compatible systems.

Free

4.3

Details Visit

Apache Drill

Schema-Free SQL Query Engine

A schema-free SQL query engine for Hadoop, NoSQL, and cloud storage. Drill enables analysts and data scientists to query self-describing data like JSON, Parquet, and CSV without requiring predefined schemas or ETL transformations.

Free

Details Visit

H2O - big-data-processing tool for Python data engineering

H2O

Scalable Machine Learning Platform

A fast, scalable, open-source machine learning and artificial intelligence platform. H2O supports widely used statistical and machine learning algorithms including gradient boosted machines, random forests, deep learning, and more with Python and R APIs.

Freemium

4.4

Details Visit

Apache Mahout

Distributed Machine Learning

An environment for quickly creating scalable, performant machine learning applications. Mahout provides mathematically expressive Scala DSL and supports Apache Spark and Apache Flink backends for distributed linear algebra operations.

Free

3.6

Details Visit

Spark MLlib

Spark's Machine Learning Library

Apache Spark's scalable machine learning library consisting of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, and dimensionality reduction. MLlib integrates seamlessly with Spark's data processing pipelines.

Free

4.5

Details Visit

Spark GraphX

Spark's Graph Processing API

Apache Spark's API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a graph abstraction, providing a set of fundamental operators and optimized algorithms for graph analytics like PageRank and connected components.

Free

4.1

Details Visit

Apache Giraph

Large-Scale Graph Processing

An iterative graph processing system built for high scalability, used at Facebook to analyze the social graph. Giraph processes billions of vertices and edges efficiently on Hadoop infrastructure using a vertex-centric programming model.

Free

3.7

Details Visit

Apache Oozie

Hadoop Workflow Scheduler

A workflow scheduler system for managing Apache Hadoop jobs. Oozie supports MapReduce, Pig, Hive, and Sqoop jobs through a coordinator and workflow engine, enabling complex multi-stage data processing pipelines on Hadoop clusters.

Free

3.6

Details Visit

Apache Gravitino

Unified Metadata Management

An open-source, unified metadata management platform for data lakes, data warehouses, and external catalogs. Gravitino provides a single point of access for managing metadata across diverse data sources, simplifying governance and discovery.

Free

Details Visit

Featured

RabbitMQ

Open Source Message Broker

A robust, open-source message broker that supports multiple messaging protocols including AMQP, MQTT, and STOMP. RabbitMQ provides reliable message delivery with flexible routing, clustering, and federation for distributed data ingestion pipelines.

Free

4.6

Details Visit

Featured

Apache Pulsar

Distributed Pub-Sub Messaging

An open-source distributed pub-sub messaging system originally created by Yahoo. Pulsar provides multi-tenancy, geo-replication, and unified messaging and streaming with a serverless compute framework for lightweight processing.

Free

4.5

Details Visit

Apache Sqoop

Hadoop-RDBMS Data Transfer

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce for parallel data transfer with support for incremental imports and direct connector APIs.

Free

3.8

Details Visit

Apache Gobblin

Universal Data Ingestion Framework

A universal data ingestion framework for Hadoop from LinkedIn. Gobblin handles the complete data ingestion lifecycle including extraction, transformation, quality checks, and publishing for both batch and streaming data sources.

Free

3.9

Details Visit

Pravega

Stream Storage System

An open-source storage system that provides a new abstraction — a stream — for continuous and unbounded data. Pravega offers auto-scaling, exactly-once semantics, and durable storage for building reliable streaming data ingestion pipelines.

Free

3.7

Details Visit

Apache Thrift

Cross-Language Services Framework

A software framework for scalable cross-language services development. Thrift combines a serialization format with an RPC framework, enabling efficient communication between services written in different programming languages.

Free

Details Visit

Featured

HDFS

Hadoop Distributed File System

A distributed file system designed to run on commodity hardware as part of the Apache Hadoop ecosystem. HDFS provides high-throughput access to application data and is the foundation for storing massive datasets in Hadoop-based data platforms.

Free

4.4

Details Visit

Alluxio

Memory-Centric Storage System

A memory-centric distributed storage system that acts as a caching layer between compute frameworks and storage systems. Alluxio accelerates data access by serving hot data from memory, bridging the gap between compute and storage.

Freemium

4.2

Details Visit

CEPH

Unified Distributed Storage

A unified, distributed storage system providing object, block, and file storage in a single platform. CEPH is designed for excellent performance, reliability, and scalability, widely used in cloud infrastructure and data center environments.

Free

4.4

Details Visit

GlusterFS

Scalable Network File System

A scalable, distributed network file system suitable for data-intensive tasks such as cloud storage and media streaming. GlusterFS aggregates disk storage from multiple servers into a single global namespace for large-scale data access.

Free

Details Visit

SeaweedFS

Simple Distributed File System

A simple and highly scalable distributed file system designed for fast, efficient storage and retrieval of billions of files. SeaweedFS supports S3 API compatibility, erasure coding, and FUSE mounting for flexible data access.

Free

4.2

Details Visit

LizardFS

Fault-Tolerant Distributed File System

A software-defined storage solution that is distributed, parallel, scalable, fault-tolerant, and geo-redundant. LizardFS provides a highly available file system with automatic data replication and self-healing capabilities.

Free

3.7

Details Visit

Project Nessie

Transactional Data Lake Catalog

A transactional catalog for data lakes with git-like semantics. Nessie works with Apache Iceberg tables to provide multi-table transactions, branching, tagging, and time-travel queries across your data lake.

Free

4.3

Details Visit

Ilum

Data Lakehouse Platform

A modular data lakehouse platform that simplifies the management and monitoring of Apache Spark clusters. Ilum provides a unified interface for running Spark jobs, managing data pipelines, and monitoring cluster health in lakehouse architectures.

Freemium

3.9

Details Visit

Tools (50)

Featured

PySpark

Python API for Apache Spark

Python API for Apache Spark, enabling scalable and efficient data processing. Particularly useful for ETL processes involving large datasets that need parallel processing across a cluster.

Free

4.8

Details Visit

Featured

Apache Airflow

Workflow Orchestration Platform

Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.

Free

4.8

Details Visit

Argo Workflows

Kubernetes-Native Workflow Engine

Open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Designed for large-scale computational tasks with powerful workflow features.

Free

4.6

Details Visit

Dask

Parallel Computing Library

Parallel computing library that scales Pandas workflows to larger-than-memory datasets. Enables parallel processing while maintaining a familiar Pandas-like interface for big data.

Free

4.6

Details Visit