JVM Tools & Datasets for Python Data Engineering

Discover 28 tools tagged with JVM for Python data engineering.

Tools (28)

Apache HBase

Distributed Column-Family Store

A distributed, scalable big data store modeled after Google's Bigtable, running on top of HDFS. HBase provides random, real-time read/write access to large datasets and is commonly used for storing sparse data in the Hadoop ecosystem.

Free

4.2

Details Visit

OrientDB

Multi-Model Graph & Document Database

A multi-model open-source NoSQL database that supports graph, document, key-value, and object models. OrientDB offers SQL-like queries on graph data and is well suited for applications with complex relationships and connected data.

Free

3.9

Details Visit

Titan

Scalable Graph Database

A scalable graph database optimized for storing and querying large graphs with billions of vertices and edges across a multi-machine cluster. Titan supports various storage backends including Cassandra, HBase, and BerkeleyDB.

Free

3.6

Details Visit

Apache Geode

Distributed In-Memory Database

An open-source, distributed, in-memory database providing reliable asynchronous event notifications and guaranteed message delivery. Apache Geode pools memory, CPU, network resources, and local disk storage across multiple processes for high-performance data management.

Free

3.8

Details Visit

Datomic

Immutable Transactional Database

A fully transactional, cloud-ready, distributed database built on immutable data and designed for value-centric programming. Datomic stores facts with full history, enabling time-travel queries and simplified application architecture.

Details Visit

OpenTSDB

Scalable Time Series Database

A scalable, distributed time series database built on HBase. OpenTSDB stores, indexes, and serves metrics collected from various sources at massive scale, making it ideal for monitoring infrastructure and IoT data collection.

Free

3.9

Details Visit

Apache Druid

Real-Time Analytics Database

A column-oriented, distributed data store designed for sub-second OLAP queries on event data. Druid is used for powering interactive analytical applications, real-time dashboards, and exploratory analytics on high-cardinality data.

Free

4.3

Details Visit

Apache Samza

Distributed Stream Processing Framework

A distributed stream processing framework that uses Apache Kafka for messaging and Apache Hadoop YARN for fault tolerance, processor isolation, security, and resource management. Samza provides a simple API for building stateful stream processing applications.

Free

Details Visit

Apache Tez

DAG-Based Processing Framework

An application framework for complex directed-acyclic-graph (DAG) based data processing tasks, built on top of Apache Hadoop YARN. Tez generalizes MapReduce to enable more efficient data processing pipelines with fewer read/write cycles.

Free

Details Visit

Featured

Presto

Distributed SQL Query Engine

A distributed SQL query engine designed to query large datasets distributed over one or more heterogeneous data sources. Presto enables interactive analytics on petabytes of data across data lakes, warehouses, and databases using standard SQL.

Free

4.5

Details Visit

Apache Hive

Data Warehouse on Hadoop

A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides a SQL-like interface (HiveQL) for querying data stored in Hadoop's HDFS and other compatible systems.

Free

4.3

Details Visit

Apache Drill

Schema-Free SQL Query Engine

A schema-free SQL query engine for Hadoop, NoSQL, and cloud storage. Drill enables analysts and data scientists to query self-describing data like JSON, Parquet, and CSV without requiring predefined schemas or ETL transformations.

Free

Details Visit

Apache Mahout

Distributed Machine Learning

An environment for quickly creating scalable, performant machine learning applications. Mahout provides mathematically expressive Scala DSL and supports Apache Spark and Apache Flink backends for distributed linear algebra operations.

Free

3.6

Details Visit

Apache Giraph

Large-Scale Graph Processing

An iterative graph processing system built for high scalability, used at Facebook to analyze the social graph. Giraph processes billions of vertices and edges efficiently on Hadoop infrastructure using a vertex-centric programming model.

Free

3.7

Details Visit

Azkaban

Hadoop Workflow Scheduler

A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban provides a web UI for managing workflows, handles job dependencies, and supports pluggable job types for running various data processing tasks.

Free

3.8

Details Visit

Apache Oozie

Hadoop Workflow Scheduler

A workflow scheduler system for managing Apache Hadoop jobs. Oozie supports MapReduce, Pig, Hive, and Sqoop jobs through a coordinator and workflow engine, enabling complex multi-stage data processing pipelines on Hadoop clusters.

Free

3.6

Details Visit

Embulk

Bulk Data Loader

An open-source bulk data loader that helps transfer data between various databases, storages, file formats, and cloud services. Embulk supports parallel processing and plugin-based architecture for extensible data transfer pipelines.

Free

3.9

Details Visit

Apache Gravitino

Unified Metadata Management

An open-source, unified metadata management platform for data lakes, data warehouses, and external catalogs. Gravitino provides a single point of access for managing metadata across diverse data sources, simplifying governance and discovery.

Free

Details Visit

Featured

Apache Pulsar

Distributed Pub-Sub Messaging

An open-source distributed pub-sub messaging system originally created by Yahoo. Pulsar provides multi-tenancy, geo-replication, and unified messaging and streaming with a serverless compute framework for lightweight processing.

Free

4.5

Details Visit

Apache Sqoop

Hadoop-RDBMS Data Transfer

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce for parallel data transfer with support for incremental imports and direct connector APIs.

Free

3.8

Details Visit

Apache Gobblin

Universal Data Ingestion Framework

A universal data ingestion framework for Hadoop from LinkedIn. Gobblin handles the complete data ingestion lifecycle including extraction, transformation, quality checks, and publishing for both batch and streaming data sources.

Free

3.9

Details Visit

Pravega

Stream Storage System

An open-source storage system that provides a new abstraction — a stream — for continuous and unbounded data. Pravega offers auto-scaling, exactly-once semantics, and durable storage for building reliable streaming data ingestion pipelines.

Free

3.7

Details Visit

Featured

Apache Avro

Schema-Based Data Serialization

A data serialization system that provides rich data structures, a compact binary format, and schema evolution support. Avro is widely used in Apache Kafka ecosystems for encoding messages with schema registry integration.

Free

4.5

Details Visit

Apache ORC

Optimized Row Columnar Format

The smallest, fastest columnar storage format for Hadoop workloads. ORC provides highly efficient compression, predicate pushdown, and ACID transaction support, making it ideal for Hive-based data warehousing.

Free

4.3

Details Visit

Kryo

Fast JVM Serialization Framework

A fast and efficient object graph serialization framework for Java. Kryo is commonly used as the serialization backend for Apache Spark and other JVM-based data processing frameworks for high-performance data exchange.

Free

4.1

Details Visit

Featured

HDFS

Hadoop Distributed File System

A distributed file system designed to run on commodity hardware as part of the Apache Hadoop ecosystem. HDFS provides high-throughput access to application data and is the foundation for storing massive datasets in Hadoop-based data platforms.

Free

4.4

Details Visit

Alluxio

Memory-Centric Storage System

A memory-centric distributed storage system that acts as a caching layer between compute frameworks and storage systems. Alluxio accelerates data access by serving hot data from memory, bridging the gap between compute and storage.

Freemium

4.2

Details Visit

Project Nessie

Transactional Data Lake Catalog

A transactional catalog for data lakes with git-like semantics. Nessie works with Apache Iceberg tables to provide multi-table transactions, branching, tagging, and time-travel queries across your data lake.

Free

4.3

Details Visit

Tools (28)