Discover 50 tools tagged with Distributed for Python data engineering.
Workflow Orchestration Platform
Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.
Kubernetes-Native Workflow Engine
Open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Designed for large-scale computational tasks with powerful workflow features.
Distributed Column-Family Store
A distributed, scalable big data store modeled after Google's Bigtable, running on top of HDFS. HBase provides random, real-time read/write access to large datasets and is commonly used for storing sparse data in the Hadoop ecosystem.
High-Performance Cassandra Alternative
A NoSQL database compatible with Apache Cassandra but built in C++ for significantly higher throughput and lower latency. ScyllaDB is designed for data-intensive applications requiring consistent single-digit millisecond performance at scale.
Fast Columnar OLAP Database
An open-source columnar database management system designed for online analytical processing (OLAP). ClickHouse delivers exceptional query performance on large datasets, making it ideal for real-time analytics, log analysis, and time-series data.
Distributed Columnar Streaming Database
A distributed, columnar, versioned, and streaming database designed for real-time and batch analytics. FiloDB combines the benefits of columnar storage with streaming ingestion, making it suitable for time-series and event data workloads.
Distributed NoSQL Cloud Database
A distributed NoSQL cloud database that delivers unmatched versatility, performance, scalability, and financial value for modern applications. Couchbase supports key-value, document, and SQL-like queries with built-in full-text search and analytics.
Distributed In-Memory Database
An open-source, distributed, in-memory database providing reliable asynchronous event notifications and guaranteed message delivery. Apache Geode pools memory, CPU, network resources, and local disk storage across multiple processes for high-performance data management.
Real-Time Analytics Database
A column-oriented, distributed data store designed for sub-second OLAP queries on event data. Druid is used for powering interactive analytical applications, real-time dashboards, and exploratory analytics on high-cardinality data.
Distributed Stream Processing Framework
A distributed stream processing framework that uses Apache Kafka for messaging and Apache Hadoop YARN for fault tolerance, processor isolation, security, and resource management. Samza provides a simple API for building stateful stream processing applications.
Incremental Data Processing Framework
An open-source framework for managing storage for real-time data processing on top of data lakes. Hudi provides record-level insert, update, and delete capabilities along with change streams, enabling incremental data pipelines on large-scale datasets.
DAG-Based Processing Framework
An application framework for complex directed-acyclic-graph (DAG) based data processing tasks, built on top of Apache Hadoop YARN. Tez generalizes MapReduce to enable more efficient data processing pipelines with fewer read/write cycles.
Data Warehouse on Hadoop
A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides a SQL-like interface (HiveQL) for querying data stored in Hadoop's HDFS and other compatible systems.
Schema-Free SQL Query Engine
A schema-free SQL query engine for Hadoop, NoSQL, and cloud storage. Drill enables analysts and data scientists to query self-describing data like JSON, Parquet, and CSV without requiring predefined schemas or ETL transformations.
Scalable Machine Learning Platform
A fast, scalable, open-source machine learning and artificial intelligence platform. H2O supports widely used statistical and machine learning algorithms including gradient boosted machines, random forests, deep learning, and more with Python and R APIs.
Distributed Machine Learning
An environment for quickly creating scalable, performant machine learning applications. Mahout provides mathematically expressive Scala DSL and supports Apache Spark and Apache Flink backends for distributed linear algebra operations.
Spark's Machine Learning Library
Apache Spark's scalable machine learning library consisting of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, and dimensionality reduction. MLlib integrates seamlessly with Spark's data processing pipelines.
Spark's Graph Processing API
Apache Spark's API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a graph abstraction, providing a set of fundamental operators and optimized algorithms for graph analytics like PageRank and connected components.
Large-Scale Graph Processing
An iterative graph processing system built for high scalability, used at Facebook to analyze the social graph. Giraph processes billions of vertices and edges efficiently on Hadoop infrastructure using a vertex-centric programming model.
Hadoop Workflow Scheduler
A workflow scheduler system for managing Apache Hadoop jobs. Oozie supports MapReduce, Pig, Hive, and Sqoop jobs through a coordinator and workflow engine, enabling complex multi-stage data processing pipelines on Hadoop clusters.
Unified Metadata Management
An open-source, unified metadata management platform for data lakes, data warehouses, and external catalogs. Gravitino provides a single point of access for managing metadata across diverse data sources, simplifying governance and discovery.
Open Source Message Broker
A robust, open-source message broker that supports multiple messaging protocols including AMQP, MQTT, and STOMP. RabbitMQ provides reliable message delivery with flexible routing, clustering, and federation for distributed data ingestion pipelines.
Distributed Pub-Sub Messaging
An open-source distributed pub-sub messaging system originally created by Yahoo. Pulsar provides multi-tenancy, geo-replication, and unified messaging and streaming with a serverless compute framework for lightweight processing.
Hadoop-RDBMS Data Transfer
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce for parallel data transfer with support for incremental imports and direct connector APIs.
Universal Data Ingestion Framework
A universal data ingestion framework for Hadoop from LinkedIn. Gobblin handles the complete data ingestion lifecycle including extraction, transformation, quality checks, and publishing for both batch and streaming data sources.
Cross-Language Services Framework
A software framework for scalable cross-language services development. Thrift combines a serialization format with an RPC framework, enabling efficient communication between services written in different programming languages.
Transactional Data Lake Catalog
A transactional catalog for data lakes with git-like semantics. Nessie works with Apache Iceberg tables to provide multi-table transactions, branching, tagging, and time-travel queries across your data lake.