Tools for collecting and ingesting data from various sources into storage and processing systems.
Data ingestion tools are specialized systems designed to collect, import, and transfer data from diverse sources into storage or processing systems. These tools handle the critical first step in any data pipeline — getting raw data from databases, APIs, message queues, files, and streaming sources into a centralized location for further processing. They support both batch and real-time ingestion patterns, ensuring data is reliably captured and delivered to downstream systems like data warehouses, data lakes, or stream processors.
Open Source Message Broker
A robust, open-source message broker that supports multiple messaging protocols including AMQP, MQTT, and STOMP. RabbitMQ provides reliable message delivery with flexible routing, clustering, and federation for distributed data ingestion pipelines.
Distributed Pub-Sub Messaging
An open-source distributed pub-sub messaging system originally created by Yahoo. Pulsar provides multi-tenancy, geo-replication, and unified messaging and streaming with a serverless compute framework for lightweight processing.
Hadoop-RDBMS Data Transfer
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce for parallel data transfer with support for incremental imports and direct connector APIs.
Universal Data Ingestion Framework
A universal data ingestion framework for Hadoop from LinkedIn. Gobblin handles the complete data ingestion lifecycle including extraction, transformation, quality checks, and publishing for both batch and streaming data sources.
Event Messaging Platform
An open-source event messaging platform that provides a REST API on top of Kafka-like queues. Nakadi simplifies event streaming by offering schema registration, data governance, and subscription-based consumption without direct Kafka client management.
Managed Real-Time Streaming
A fully managed, cloud-based service from AWS for real-time data streaming and processing. Kinesis enables collecting, processing, and analyzing streaming data at any scale, with integrations across the AWS ecosystem.
AWS Data Utility Belt for Python
A utility belt for handling data on AWS using Python. AWS Data Wrangler extends Pandas with connectors to AWS services like S3, Glue, Athena, Redshift, and more, simplifying data ingestion and extraction in AWS-based pipelines.
Delimited Data Preboarding
A delimited data preboarding framework that fills the gap between managed file transfer and the data lake. CsvPath provides a domain-specific language for validating, transforming, and routing CSV and other delimited files before ingestion.