Data Ingestion

Tools for collecting and ingesting data from various sources into storage and processing systems.

What are Data Ingestion Tools?

Data ingestion tools are specialized systems designed to collect, import, and transfer data from diverse sources into storage or processing systems. These tools handle the critical first step in any data pipeline — getting raw data from databases, APIs, message queues, files, and streaming sources into a centralized location for further processing. They support both batch and real-time ingestion patterns, ensuring data is reliably captured and delivered to downstream systems like data warehouses, data lakes, or stream processors.

Featured

RabbitMQ

Open Source Message Broker

A robust, open-source message broker that supports multiple messaging protocols including AMQP, MQTT, and STOMP. RabbitMQ provides reliable message delivery with flexible routing, clustering, and federation for distributed data ingestion pipelines.

Free

4.6

Details Visit

Featured

Apache Pulsar

Distributed Pub-Sub Messaging

An open-source distributed pub-sub messaging system originally created by Yahoo. Pulsar provides multi-tenancy, geo-replication, and unified messaging and streaming with a serverless compute framework for lightweight processing.

Free

4.5

Details Visit

FluentD

Unified Logging Layer

An open-source data collector for building a unified logging layer. FluentD structures data as JSON and provides 500+ community-contributed plugins for connecting various data sources and outputs, widely used for log aggregation and forwarding.

Free

4.4

Details Visit

Apache Sqoop

Hadoop-RDBMS Data Transfer

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce for parallel data transfer with support for incremental imports and direct connector APIs.

Free

3.8

Details Visit

Apache Gobblin

Universal Data Ingestion Framework

A universal data ingestion framework for Hadoop from LinkedIn. Gobblin handles the complete data ingestion lifecycle including extraction, transformation, quality checks, and publishing for both batch and streaming data sources.

Free

3.9

Details Visit

Nakadi

Event Messaging Platform

An open-source event messaging platform that provides a REST API on top of Kafka-like queues. Nakadi simplifies event streaming by offering schema registration, data governance, and subscription-based consumption without direct Kafka client management.

Free

3.8

Details Visit

Pravega

Stream Storage System

An open-source storage system that provides a new abstraction — a stream — for continuous and unbounded data. Pravega offers auto-scaling, exactly-once semantics, and durable storage for building reliable streaming data ingestion pipelines.

Free

3.7

Details Visit

AWS Kinesis

Managed Real-Time Streaming

A fully managed, cloud-based service from AWS for real-time data streaming and processing. Kinesis enables collecting, processing, and analyzing streaming data at any scale, with integrations across the AWS ecosystem.

4.4

Details Visit

AWS Data Wrangler

AWS Data Utility Belt for Python

A utility belt for handling data on AWS using Python. AWS Data Wrangler extends Pandas with connectors to AWS services like S3, Glue, Athena, Redshift, and more, simplifying data ingestion and extraction in AWS-based pipelines.

Free

4.3

Details Visit

CsvPath Framework

Delimited Data Preboarding

A delimited data preboarding framework that fills the gap between managed file transfer and the data lake. CsvPath provides a domain-specific language for validating, transforming, and routing CSV and other delimited files before ingestion.

Free

3.7

Details Visit

Kreuzberg

Polyglot Document Intelligence

A polyglot document intelligence library with a Rust core and bindings for Python, TypeScript, Go, and more. Kreuzberg extracts text and structured data from documents like PDFs, images, and office files for data ingestion pipelines.

Free

3.8

Details Visit

db2lake

Database to Data Lake ETL

A lightweight Node.js ETL framework for moving data from databases to data lakes and data warehouses. db2lake provides simple configuration-driven extraction with support for incremental loads and multiple output formats.

Free

3.5

Details Visit

What are Data Ingestion Tools?