Free Tools &amp; Datasets for Python Data Engineering

Petl

Python ETL Package

Python package specifically designed for ETL tasks, offering tools for data extraction, transformation, and loading. Suitable for simpler, script-based ETL processes.

Free

4.3

Featured

PySpark

Python API for Apache Spark

Python API for Apache Spark, enabling scalable and efficient data processing. Particularly useful for ETL processes involving large datasets that need parallel processing across a cluster.

Free

4.8

DLT (Data Load Tool)

Python Data Loading Library

Python library that facilitates the loading phase in ETL processes. Designed to simplify loading data into various data stores or processing systems.

Free

4.5

Featured

dbt (Data Build Tool)

Transform Data in Your Warehouse

Open-source transformation tool enabling data analysts and engineers to transform, test, and document data in the warehouse. Focuses on the transform part of ETL with SQL templating and Python scripting.

Freemium

4.9

Bonobo

Lightweight ETL Framework

Lightweight Extract-Transform-Load (ETL) framework for Python 3.6+. Allows writing ETL scripts in pure Python, particularly suited for simple and straightforward ETL tasks.

Free

4.2

Mage.AI

Data Pipeline Tool

Modern data pipeline tool focused on automating data preparation and feature engineering for machine learning. Streamlines the data transformation process in ETL workflows.

Freemium

4.6

Featured

Apache Airflow

Workflow Orchestration Platform

Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.

Free

4.8

Luigi

Batch Job Pipeline Builder

Developed by Spotify, Luigi helps build complex pipelines of batch jobs, handling dependency resolution, workflow management, and task visualization.

Free

4.4

Apache NiFi

Data Flow Automation

Easy-to-use, powerful, and reliable system to process and distribute data, offering a web-based user interface for data flow management.

Free

4.5

Featured

Prefect

Modern Workflow Orchestration

Workflow management system designed for modern infrastructure, with a focus on simplicity, ease of use, and flexibility in defining and executing workflows.

Freemium

4.7

Featured

Dagster

Data Orchestrator for ML & Analytics

Open-source data orchestrator for machine learning, analytics, and ETL. Focuses on development, production, and observation of data assets with integrated pipeline views.

Freemium

4.7

Argo Workflows

Kubernetes-Native Workflow Engine

Open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Designed for large-scale computational tasks with powerful workflow features.

Free

4.6

Dask

Parallel Computing Library

Parallel computing library that scales Pandas workflows to larger-than-memory datasets. Enables parallel processing while maintaining a familiar Pandas-like interface for big data.

Free

4.6

Featured

NumPy

Numerical Computing Library

Fundamental library for numerical computing in Python. Supports large multi-dimensional arrays and matrices with a vast collection of mathematical functions for array operations.

Free

4.9

Beautiful Soup

Web Scraping & HTML Parsing

Library for web scraping and parsing HTML/XML documents. Extensively used in data wrangling to clean, parse, and extract data from web sources.

Free

4.5

Scrapy

Web Crawling Framework

Powerful web crawling and scraping framework for extracting, cleaning, and processing large volumes of web data. Essential for data wrangling from web sources.

Free

4.6

TextBlob

Text Processing Library

Simple library for processing textual data with APIs for common NLP tasks. Essential for data wrangling when dealing with text data and natural language processing.

Free

4.3

Featured

Pydantic

Data Validation using Type Hints

Data validation and settings management library using Python type annotations. Ensures data conforms to defined schemas with Python's typing module, perfect for FastAPI and modern Python apps.

Free

4.9

Featured

Marshmallow

Object Serialization & Validation

ORM/ODM/framework-agnostic library for object serialization and deserialization. Converts complex data types to and from native Python datatypes with robust validation.

Free

4.7

Cerberus

Lightweight Data Validation

Lightweight and extensible data validation library supporting complex data structures with customizable validation rules. Highly flexible for various validation needs.

Free

4.5

Voluptuous

Python Data Structure Validation

Validates Python data structures with straightforward syntax and clear error messages. Ensures structure and content adhere to specified schemas.

Free

4.3

jsonschema

JSON Schema Validator

Library for validating JSON data against JSON Schema standards. Essential when working with JSON data formats to ensure schema compliance.

Free

4.6

Featured

Pandera

DataFrame Validation

Flexible API for data validation on dataframe structures. Validates dataframes in real-time, integrates with pydantic and fastapi. Essential for production data pipelines.

Free

4.7

Validr

Fast Validation Library

Fast, simple, and powerful validation library with declarative validation rules. Optimized for performance when validating data from various sources.

Free

4.2

Featured

SQLAlchemy

Python SQL Toolkit & ORM

Widely used ORM library providing a full suite of enterprise-level persistence patterns. Designed for efficient, high-performing database access with flexible SQL abstraction.

Free

4.9

Featured

Django ORM

Django's Built-in ORM

Part of Django web framework, allows defining data models entirely in Python. Provides powerful abstraction layer to translate Python code to SQL seamlessly.

Free

4.8

Peewee

Small Expressive ORM

Small, expressive ORM with simple and intuitive interface. Lightweight and easy to use, perfect for small to medium-sized applications prioritizing simplicity.

Free

4.6

Pony ORM

Pythonic Query Language

Unique ORM using generator expressions for queries. Intuitive and user-friendly, allowing complex queries in pure Python that mirror human language.

Free

4.5

SQLObject

Object Interface to Database

Popular ORM providing object-oriented interface with tables as classes and rows as instances. Supports variety of database backends with simplicity focus.

Free

4.2

Tortoise ORM

Async ORM for Python

Easy-to-use asyncio ORM inspired by Django. Designed for async/await syntax, making it perfect for asynchronous applications and modern Python development.

Free

4.6

Gino

Async SQLAlchemy ORM

Async ORM built on SQLAlchemy core for asyncio programming. Provides simple and intuitive API for asynchronous database interactions with high performance.

Free

4.4

Featured

Alembic

Database Migrations for SQLAlchemy

Lightweight database migration tool for use with SQLAlchemy. Alembic allows you to create, manage, and invoke change management scripts for your database, facilitating schema migrations as your application evolves.

Free

4.7

Flyway

Database Migration Tool

Robust version control tool for databases with support for SQL-based migrations. While not Python-specific, widely used in the community and easily integrated into Python projects for database schema management.

Free / Paid

4.6

Featured

Django Migrations

Built-in Django Migration Framework

Django's powerful built-in migration framework that comes bundled with Django. Allows you to change your database schema without losing data using a simple and intuitive API.

Free

4.8

Flask-Migrate

Database Migrations for Flask

Extension that handles SQLAlchemy database migrations for Flask applications using Alembic. Provides command-line tools to manage and automate database migrations in Flask projects.

Free

4.5

yoyo-migrations

Database Schema Migration Tool

Database schema migration tool that lets you manage your database schema by applying and rolling back migration scripts written in pure SQL or Python. Simple and flexible approach to database migrations.

Free

4.3

SQLAlchemy-Migrate

Schema Versioning for SQLAlchemy

Provides a way to deal with database schema changes in SQLAlchemy projects. Extends SQLAlchemy to have database schema versioning and migration capabilities for managing database evolution.

Free

4.2

South

Legacy Django Migrations

The original migration tool for Django before built-in migrations were added in Django 1.7. Still relevant for maintaining or upgrading legacy Django applications running older versions.

Free

Featured

Apache Kafka

Distributed Event Streaming Platform

Distributed event streaming platform capable of handling trillions of events a day. Used for building real-time streaming data pipelines and applications with high-throughput, fault-tolerance, and scalability.

Free

4.8

Featured

Apache Flink

Stream Processing Framework

Framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Known for high performance in streaming data processing with exactly-once semantics.

Free

4.7

Apache Storm

Real-Time Computation System

Real-time computation system making it easy to process unbounded streams of data reliably. Fast and scalable distributed real-time computation framework for stream processing.

Free

4.4

Faust

Python Stream Processing

Stream processing library porting ideas from Kafka Streams to Python. Used for building high-performance and reliable real-time stream processing applications with Pythonic API.

Free

4.5

Apache Spark Streaming

Scalable Stream Processing

Extension of Apache Spark API enabling scalable, high-throughput, fault-tolerant processing of live data streams. Integrated within Spark ecosystem for complex real-time data processing tasks.

Free

4.6

Redpanda

Modern Streaming Platform

Streaming data platform API-compatible with Apache Kafka but designed for better performance and easier operational management. Modern streaming platform for mission-critical workloads.

Free / Paid

4.6

Featured

Flask

Lightweight Web Framework

Lightweight WSGI web application framework easy to get started with and versatile for complex applications. Popular for building web APIs thanks to simplicity and extensibility.

Free

4.8

Featured

Django REST Framework

Powerful API Toolkit for Django

Powerful and flexible toolkit for building Web APIs in Django. Highly recommended for adding API capabilities to Django applications with comprehensive features and excellent documentation.

Free

4.9

Featured

FastAPI

Modern High-Performance Framework

Modern, fast (high-performance) web framework for building APIs with Python 3.7+ based on standard type hints. Features automatic API documentation, easy to use, and blazing fast execution.

Free

4.9

Tornado

Asynchronous Networking Library

Python web framework and asynchronous networking library. Particularly useful for long-polling, WebSockets, and applications requiring long-lived connections to each user.

Free

4.5

Falcon

High-Performance Python Framework

Reliable, high-performance Python framework for building large-scale app backends and microservices. Encourages REST architectural style while remaining highly effective and minimalist.

Free

4.6

Featured

Great Expectations

Data Validation & Documentation

Comprehensive tool helping data teams validate, document, and profile their data. Define expectations for your data ensuring it meets quality standards before processing.

Free / Paid

4.7

Ydata Profiling

Automated Data Profiling

Generates profile reports from pandas DataFrames. Excellent tool for quickly understanding data with interactive HTML reports including statistics, distributions, and correlations.

Free

4.6

PyDeequ

Data Quality for Big Data

Python API for Deequ, AWS library built on Apache Spark for defining and verifying data quality constraints. Useful for large-scale data processing and quality verification.

Free

4.5

Dedupe

ML-Powered Deduplication

Python library using machine learning to perform deduplication and entity resolution on structured data. Particularly useful for identifying and merging duplicate records.

Free

4.4

Soda Core

Data Quality Testing

Open-source data quality tool with CLI for defining, running, and monitoring data quality checks. Write tests to verify data meets conditions like missing values, ranges, or uniqueness.

Free / Paid

4.6

DataCleaner

Automated Data Cleaning

Automatic tool for cleaning and preprocessing data. Handles missing values, encodes categorical data, and scales features making data preparation efficient.

Free

4.2

Data Linter

Schema Validation Tool

Python package for automated data validation within Data Engineering pipelines. Engineered to ingest and validate tabular data against predefined schemas.

Free

4.1

Featured

Matplotlib

Comprehensive Visualization Library

Comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib is versatile and widely used for plotting graphs and charts with extensive customization options.

Free

4.8

Featured

Seaborn

Statistical Data Visualization

Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics, simplifying the creation of complex visualizations with beautiful default themes.

Free

4.7

Featured

Plotly

Interactive Visualization Library

Plotly offers a range of interactive plotting options and is known for its advanced graphics and interactivity, supporting complex visualizations with ease. Perfect for creating web-based dashboards.

Free / Paid

4.8

Bokeh

Interactive Web Visualizations

Bokeh focuses on building interactive, web-ready plots, which can be a powerful tool for creating dynamic visualizations that can easily be embedded in web applications.

Free

4.6

Altair

Declarative Visualization

Altair is a declarative statistical visualization library for Python, offering a simple and concise way to create a wide range of statistical plots based on a logical data mapping.

Free

4.5

Featured

Scikit-learn

Machine Learning in Python

Versatile library providing a range of supervised and unsupervised learning algorithms. Known for its ease of use and efficiency for data mining and data analysis with classical ML algorithms.

Free

4.9

Featured

TensorFlow

End-to-End ML Platform

End-to-end open-source platform for machine learning enabling complex computations with data flow graphs. Widely used for deep learning applications with robust production support.

Free

4.8

Featured

PyTorch

Deep Learning Framework

Open-source machine learning library known for its flexibility, ease of use, and as a preferred tool for research in deep learning and artificial intelligence. Dynamic computation graphs.

Free

4.8

Keras

High-Level Neural Networks API

High-level neural networks API designed for fast experimentation with deep neural networks. Runs on top of TensorFlow offering a user-friendly interface for building models.

Free

4.7

Featured

XGBoost

Extreme Gradient Boosting

Highly efficient implementation of gradient boosting frameworks designed for speed and performance. Widely used in machine learning competitions and practical applications for structured data.

Free

4.8

LightGBM

Light Gradient Boosting Machine

Gradient boosting framework using tree-based learning algorithms. Designed for speed and efficiency, supporting large datasets and distributed computing for various ML tasks.

Free

4.7

CatBoost

Gradient Boosting on Decision Trees

Algorithm for gradient boosting on decision trees developed by Yandex. Particularly effective for datasets with categorical features, known for robustness and handling overfitting well.

Free

4.6

Apache Hadoop

Distributed Storage and Processing Framework

Framework that allows for distributed processing of large datasets across clusters of computers using simple programming models. Designed to scale from single servers to thousands of machines, each offering local computation and storage. Uses HDFS for distributed storage and MapReduce for processing.

Free

4.2

Featured

Apache Beam

Unified Batch and Stream Processing

Advanced unified programming model for defining and executing data processing workflows that can run on any execution engine. Provides portability across multiple execution environments including Apache Flink, Apache Spark, and Google Cloud Dataflow. Ideal for building flexible, scalable data pipelines.

Free

4.5

Featured

Boto3

AWS SDK for Python

The official Amazon Web Services (AWS) SDK for Python. Enables Python developers to write software that makes use of services like Amazon S3, EC2, Lambda, and more. Provides easy-to-use, object-oriented API as well as low-level access to AWS services, making it simple to integrate Python applications with AWS infrastructure.

Free

4.8

Featured

Google Cloud Client Libraries

GCP SDK for Python

Google Cloud Platform's official client library for Python, enabling seamless integration with GCP services like Compute Engine, Cloud Storage, BigQuery, and Pub/Sub. Designed for a Pythonic, intuitive experience when interacting with Google Cloud services, with idiomatic code patterns and comprehensive documentation.

Free

4.7

Azure SDK for Python

Microsoft Azure SDK

Microsoft's comprehensive Azure SDK for Python offering a complete set of packages to interact with Azure resources and services. Supports wide range of Azure services including Virtual Machines, Storage, Databases, AI services, and more. Provides tools for effective resource management and service interaction within Azure ecosystem.

Free

4.6

IBM Cloud Python SDK

IBM Cloud Services SDK

Official SDK for interacting with various IBM Cloud services programmatically. Provides comprehensive support for IBM Cloud services including CIS, DNS, IAM, VPC, Watson AI, and more. Enables management and automation of IBM Cloud resources with Python, compatible with Python 3.6 and above.

Free

4.3

Oracle Cloud Infrastructure SDK

OCI SDK for Python

Official SDK for writing code to manage Oracle Cloud Infrastructure resources. Supports wide range of Oracle Cloud services with functionalities for compute, storage, networking, databases, and more. Available across multiple operating systems and Python versions, providing robust interface for OCI resource management.

Free

4.4

Featured

dbdiagram.io

Database Design as Code

Free, simple tool to draw Entity-Relationship diagrams by just writing code. Designed to help developers design and visualize database structures in a straightforward and intuitive way. Perfect for quickly sketching database schemas and sharing them with your team through simple DSL syntax.

Free

4.6

Featured

MySQL Workbench

MySQL Database Design Tool

Integrated tool provided by MySQL for database design, modeling, administration, and maintenance. Provides visual interface for creating, managing, and analyzing MySQL databases. Includes data modeling, SQL development, and comprehensive administration tools for MySQL database systems.

Free

4.4

ERAlchemy

ER Diagrams from SQLAlchemy

Python library designed to create Entity Relationship diagrams by extracting data from databases or SQLAlchemy models. Particularly useful for database designers and developers who need to visualize and interpret complex relationships within database systems. Generates diagrams automatically from your Python code.

Free

4.2

Dia - data-modeling tool for Python data engineering

Dia

Open Source Diagramming

Free and open-source diagramming tool that can be used to create Entity-Relationship diagrams. Versatile application suitable for simple modeling tasks, flowcharts, network diagrams, and database schemas. Lightweight alternative for developers who need basic ER diagram functionality.

Free

3.8

Featured

PostgreSQL

Advanced Open Source Database

Powerful, open-source object-relational database system known for reliability, feature robustness, and performance. Widely used in Python community with excellent support for advanced data types, JSON, full-text search, and performance optimization. ACID-compliant with strong community and enterprise adoption.

Free

4.8

Featured

Redis

In-Memory Data Store

Open-source, in-memory data structure store used as database, cache, and message broker. Supports various data structures including strings, hashes, lists, sets, sorted sets, and streams. Provides high performance, sub-millisecond latency, and is widely used for caching, session management, and real-time analytics.

Free

4.7

Apache Cassandra

Distributed Wide-Column Store

Highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers with no single point of failure. Provides high availability and linear scalability. Ideal for applications requiring continuous availability and massive write throughput.

Free

4.3

Apache Atlas

Enterprise Data Governance

Scalable and extensible set of core foundational governance services for Hadoop ecosystem and enterprise data. Enables organizations to effectively meet compliance requirements with metadata management, data classification, and lineage tracking. Integrates with Python through REST APIs for governance automation.

Free

4.2

Featured

Amundsen

Data Discovery & Metadata Engine

Data discovery and metadata engine for improving productivity of data analysts, scientists, and engineers when interacting with data. Provides powerful search, data previews, and column-level lineage. Integrates seamlessly with Python environments and modern data stacks for comprehensive metadata management.

Free

4.5

CKAN

Open Data Management System

Powerful data management system that makes data accessible by providing tools to streamline publishing, sharing, finding, and using data. Aimed at data publishers wanting to make their data open and available. Features data cataloging, API generation, and visualization capabilities.

Free

4.1

Marquez

Metadata Service for Data Lineage

Open-source metadata service for collection, aggregation, and visualization of data ecosystem metadata. Provides common interface to track data lineage across your entire data platform. Offers Python client for integration and supports OpenLineage standard for lineage collection.

Free

4.3

Featured

DataHub

Modern Metadata Platform

Open-source metadata platform for the modern data stack. Provides powerful and flexible metadata search, discovery, and lineage capabilities. Features real-time metadata updates, data quality monitoring, and governance workflows. Extensive Python SDK for automation and integration.

Free

4.6

Featured

Stack Overflow

Q&A for Data Engineers

Vast community of developers and IT professionals with extensive data engineering questions and answers. Rich resource for troubleshooting, learning from real-world problems, and discovering solutions. Active community providing quick responses to technical challenges in Python data engineering.

Free

4.7

Featured

r/dataengineering

Data Engineering Subreddit

Dedicated subreddit for data engineering professionals and enthusiasts. Active discussions on trends, articles, questions, and insights related to data engineering including Python-specific topics. Great for staying updated with industry news and community wisdom.

Free

4.6

Featured

dbt Community

Analytics Engineering Hub

Vibrant Slack community focused on dbt and modern data practices. Fantastic place for data engineers to discuss analytics engineering, share experiences, and find support on various data topics including Python integrations. Active community with thousands of members.

Free

4.8

Featured

Kaggle

Data Science Competition Platform

World's largest data science community featuring competitions, datasets, and collaborative notebooks. Members share code, discuss methodologies, and collaborate on projects. Excellent for finding practical Python examples, innovative solutions, and learning from top data scientists.

Free

4.7

Data Engineering Social Club

LinkedIn Professional Network

LinkedIn group where data engineering professionals share articles, discuss industry trends, and network. Members engage in discussions, share insights, and connect for career opportunities. Great for professional networking and staying informed about industry developments.

Free

4.3

Operational Analytics Club

Analytics Community

Community dedicated to operational analytics, offering resources and discussions to help professionals leverage data for operational decision-making. Focuses on practical applications of analytics in business operations and real-time data systems.

Free

4.2

Data-Centric AI Community

AI Data Quality Focus

Community focusing on data-centric aspects of AI development. Provides resources and discussions on improving data quality and processes in AI projects. Emphasizes the importance of high-quality data over just algorithms in machine learning success.

Free

4.3

MLOps Discord

MLOps Community Chat

Active Discord community focused on Machine Learning Operations (MLOps). Members discuss best practices, tools, and strategies for deploying and maintaining ML systems. Great for real-time discussions on ML engineering, monitoring, and production systems.

Free

4.4

Featured

DataTalks.Club

Data Community & Events

Community of data enthusiasts sharing knowledge through talks, discussions, and events. Offers free courses, weekly events, and active Slack community. Covers data engineering, machine learning, and analytics with hands-on learning opportunities.

Free

4.7

MLOps Community

MLOps Learning Hub

Open and inclusive community for individuals interested in Machine Learning Operations. Provides resources, discussions, podcasts, and events for implementing MLOps practices. Focuses on bridging gap between ML development and production deployment.

Free

4.5

Locally Optimistic

Data Leaders Community

Community for data professionals offering insightful content, discussions, and resources. Focuses on data analytics leadership, career growth, and collaboration. Features newsletter, Slack community, and blog posts from experienced data leaders.

Free

4.4

Featured

Data Engineering Discord

DE Community Chat

Discord server dedicated to data engineering with active discussions on data infrastructure, architectures, pipelines, and best practices. Real-time community support for troubleshooting, career advice, and learning. Great for connecting with other data engineers globally.

Free

4.6

Featured

Pandas

Data Analysis & Manipulation

Foundational library for data manipulation and analysis in Python. Provides fast, flexible, and expressive data structures (DataFrames) designed for working with structured, tabular, and time series data. Essential tool for data wrangling with comprehensive features for indexing, grouping, merging, and filtering.

Free

4.9

OpenRefine

Data Cleaning & Transformation

Powerful tool for working with messy data, cleaning it, transforming from one format to another, and extending it with web services or external data. Although not a Python library, it's valuable for advanced data wrangling alongside Python tools.

Free

4.5

ORM (encode/orm)

Lightweight Async ORM

Lightweight and async-ready ORM designed to work with FastAPI and Starlette. Particularly suited for applications requiring asynchronous database operations with minimal overhead and modern Python async/await patterns.

Free

4.3

Featured

Python

Programming Language

Python is a high-level, interpreted programming language that has become the dominant language for data engineering. Known for its clear syntax, extensive standard library, and rich ecosystem of data-focused packages. Essential foundation for all Python data engineering work.

Free

4.9

Featured

Visual Studio Code

Code Editor & IDE

Powerful, free code editor with excellent Python support through extensions. Features IntelliSense, debugging, Git integration, and a vast marketplace of extensions. The most popular IDE for Python data engineering with powerful features for managing virtual environments and running code.

Free

4.8

pip - getting-started tool for Python data engineering

pip

Python Package Installer

The standard package installer for Python. Used to install and manage Python packages from the Python Package Index (PyPI) and other repositories. Essential tool for managing dependencies in any Python project, comes bundled with Python installations.

Free

4.7

virtualenv / venv

Virtual Environment Manager

Tools for creating isolated Python environments, allowing you to manage project-specific dependencies without conflicts. venv comes built into Python 3, while virtualenv offers additional features. Critical for professional Python development and maintaining clean, reproducible environments.

Free

4.6

Featured

Docker

Containerization Platform

Industry-standard platform for developing, shipping, and running applications in containers. Essential for data engineering to run databases, Kafka, and other services in isolated, reproducible environments. Docker Desktop provides an easy-to-use interface for managing containers across all operating systems.

Free

4.8

Docker Compose

Multi-Container Orchestration

Tool for defining and running multi-container Docker applications using YAML configuration files. Perfect for data engineering workflows that require multiple services like databases, message queues, and processing engines running together. Simplifies complex container setups into simple, version-controlled configurations.

Free

4.7

MySQL

Popular Open Source Relational Database

The world's most popular open source relational database, widely used for web applications and data-driven projects. MySQL offers robust SQL support, replication, and a mature ecosystem of tools and connectors for Python integration.

Free

4.7

MariaDB

Enhanced MySQL Fork

An enhanced, drop-in replacement for MySQL created by the original MySQL developers. MariaDB offers improved performance, additional storage engines, and stronger commitment to open source development while maintaining full MySQL compatibility.

Free

4.5

CrateDB

Distributed SQL Database

A scalable SQL database that combines the familiarity of SQL with the scalability of NoSQL. CrateDB is optimized for machine data and IoT workloads, offering real-time analytics on large volumes of data with a distributed architecture.

Freemium

4.2

RQLite

Distributed SQLite Database

A lightweight, distributed relational database built on SQLite and using the Raft consensus protocol. RQLite provides fault-tolerant, replicated SQLite with an easy-to-use HTTP API, ideal for edge computing and embedded applications.

Free

4.1

Riak

Distributed Key-Value Store

A distributed NoSQL database designed to deliver maximum data availability by distributing data across multiple servers. Riak offers automatic data replication, fault tolerance, and near-linear scalability for read and write operations.

Free

3.8

Apache HBase

Distributed Column-Family Store

A distributed, scalable big data store modeled after Google's Bigtable, running on top of HDFS. HBase provides random, real-time read/write access to large datasets and is commonly used for storing sparse data in the Hadoop ecosystem.

Free

4.2

Featured

ClickHouse

Fast Columnar OLAP Database

An open-source columnar database management system designed for online analytical processing (OLAP). ClickHouse delivers exceptional query performance on large datasets, making it ideal for real-time analytics, log analysis, and time-series data.

Freemium

4.7

FiloDB

Distributed Columnar Streaming Database

A distributed, columnar, versioned, and streaming database designed for real-time and batch analytics. FiloDB combines the benefits of columnar storage with streaming ingestion, making it suitable for time-series and event data workloads.

Free

3.7

RethinkDB

Realtime Document Database

An open-source document database designed for building real-time web applications. RethinkDB pushes query results to applications in real-time, eliminating the need for polling and making it ideal for collaborative apps and live dashboards.

Free

OrientDB

Multi-Model Graph & Document Database

A multi-model open-source NoSQL database that supports graph, document, key-value, and object models. OrientDB offers SQL-like queries on graph data and is well suited for applications with complex relationships and connected data.

Free

3.9

Titan

Scalable Graph Database

A scalable graph database optimized for storing and querying large graphs with billions of vertices and edges across a multi-machine cluster. Titan supports various storage backends including Cassandra, HBase, and BerkeleyDB.

Free

3.6

Apache Geode

Distributed In-Memory Database

An open-source, distributed, in-memory database providing reliable asynchronous event notifications and guaranteed message delivery. Apache Geode pools memory, CPU, network resources, and local disk storage across multiple processes for high-performance data management.

Free

3.8

OpenTSDB

Scalable Time Series Database

A scalable, distributed time series database built on HBase. OpenTSDB stores, indexes, and serves metrics collected from various sources at massive scale, making it ideal for monitoring infrastructure and IoT data collection.

Free

3.9

QuestDB

Fast SQL Time Series Database

A relational column-oriented database designed for real-time analytics on time series and event data. QuestDB uses SQL with time-series extensions and delivers exceptional ingestion performance, ideal for financial data, IoT, and application metrics.

Freemium

4.5

Featured

DuckDB

In-Process Analytical Database

A fast, in-process analytical database with zero external dependencies. DuckDB is designed for analytical query workloads and integrates seamlessly with Python and Pandas, making it ideal for local data analysis and embedded analytics.

Free

4.8

Apache Druid

Real-Time Analytics Database

A column-oriented, distributed data store designed for sub-second OLAP queries on event data. Druid is used for powering interactive analytical applications, real-time dashboards, and exploratory analytics on high-cardinality data.

Free

4.3

Greenplum

Open Source Data Warehouse

An advanced, fully featured, open source data warehouse based on PostgreSQL. Greenplum provides powerful and rapid analytics on petabyte-scale data volumes with massively parallel processing (MPP) architecture.

Free

4.2

Tarantool

In-Memory Database & App Server

An in-memory computing platform combining a database with an application server. Tarantool delivers high performance for both OLTP and data-intensive applications with built-in Lua scripting and persistent storage capabilities.

Free

Apache Samza

Distributed Stream Processing Framework

A distributed stream processing framework that uses Apache Kafka for messaging and Apache Hadoop YARN for fault tolerance, processor isolation, security, and resource management. Samza provides a simple API for building stateful stream processing applications.

Free

Featured

Apache Hudi

Incremental Data Processing Framework

An open-source framework for managing storage for real-time data processing on top of data lakes. Hudi provides record-level insert, update, and delete capabilities along with change streams, enabling incremental data pipelines on large-scale datasets.

Free

4.4

PipelineDB

Streaming SQL Database

A streaming SQL database that runs SQL queries continuously on incoming data streams. PipelineDB is built as a PostgreSQL extension, allowing you to use standard SQL to define continuous views over streaming data for real-time analytics.

Free

3.8

Pathway

Python ETL for Real-Time Data

A performant open-source Python ETL framework with a Rust runtime, supporting over 300 data sources. Pathway enables building real-time data processing pipelines with a simple Python API, handling both batch and streaming workloads seamlessly.

Free

4.3

HStreamDB

Streaming Database for IoT

A streaming database built for IoT data storage and real-time processing. HStreamDB provides stream ingestion, storage, and processing in a unified platform, enabling end-to-end streaming data management for IoT and event-driven applications.

Free

3.6

Zilla

Event-Driven API Gateway

An API gateway built for event-driven architectures and streaming. Zilla natively supports Kafka, SSE, WebSocket, MQTT, and HTTP protocols, enabling seamless integration between REST/GraphQL clients and event-driven backends.

Free

3.7

SwimOS

Real-Time Streaming Data Platform

A framework for building real-time streaming data processing applications. SwimOS combines streaming data processing with a built-in state store and UI capabilities, enabling continuous intelligence applications that process and visualize data in real-time.

Free

3.5

Apache Tez

DAG-Based Processing Framework

An application framework for complex directed-acyclic-graph (DAG) based data processing tasks, built on top of Apache Hadoop YARN. Tez generalizes MapReduce to enable more efficient data processing pipelines with fewer read/write cycles.

Free

Featured

Presto

Distributed SQL Query Engine

A distributed SQL query engine designed to query large datasets distributed over one or more heterogeneous data sources. Presto enables interactive analytics on petabytes of data across data lakes, warehouses, and databases using standard SQL.

Free

4.5

Apache Hive

Data Warehouse on Hadoop

A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides a SQL-like interface (HiveQL) for querying data stored in Hadoop's HDFS and other compatible systems.

Free

4.3

Apache Drill

Schema-Free SQL Query Engine

A schema-free SQL query engine for Hadoop, NoSQL, and cloud storage. Drill enables analysts and data scientists to query self-describing data like JSON, Parquet, and CSV without requiring predefined schemas or ETL transformations.

Free

Apache Mahout

Distributed Machine Learning

An environment for quickly creating scalable, performant machine learning applications. Mahout provides mathematically expressive Scala DSL and supports Apache Spark and Apache Flink backends for distributed linear algebra operations.

Free

3.6

Spark MLlib

Spark's Machine Learning Library

Apache Spark's scalable machine learning library consisting of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, and dimensionality reduction. MLlib integrates seamlessly with Spark's data processing pipelines.

Free

4.5

Spark GraphX

Spark's Graph Processing API

Apache Spark's API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a graph abstraction, providing a set of fundamental operators and optimized algorithms for graph analytics like PageRank and connected components.

Free

4.1

Apache Giraph

Large-Scale Graph Processing

An iterative graph processing system built for high scalability, used at Facebook to analyze the social graph. Giraph processes billions of vertices and edges efficiently on Hadoop infrastructure using a vertex-centric programming model.

Free

3.7

Kedro

Python Data Pipeline Framework

An open-source Python framework for creating reproducible, maintainable, and modular data science code. Kedro applies software engineering best practices to data pipelines with built-in data catalog, pipeline visualization, and experiment tracking.

Free

4.4

Hamilton

DAG-Based Data Transformation Library

A lightweight Python library for defining data transformations as a directed acyclic graph (DAG). Hamilton uses Python function signatures to define dataflow, making pipelines self-documenting, testable, and easy to maintain.

Free

4.2

Kestra

Event-Driven Orchestration Platform

A scalable, event-driven, language-agnostic orchestration and scheduling platform. Kestra provides a declarative YAML-based workflow definition with a rich UI, supporting hundreds of plugins for data engineering, DevOps, and microservice orchestration.

Freemium

4.4

SQLMesh

Data Transformation Framework

An open-source data transformation framework for managing, testing, and deploying SQL and Python-based data pipelines. SQLMesh provides virtual data environments, automatic change detection, and incremental processing for efficient data warehouse management.

Free

4.3

Dataform

SQL-Based Data Transformation

An open-source framework and web-based IDE for managing datasets and their dependencies in data warehouses. Dataform uses SQL with added features like dependency management, testing, and documentation, now integrated into Google Cloud.

Free

4.1

Azkaban

Hadoop Workflow Scheduler

A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban provides a web UI for managing workflows, handles job dependencies, and supports pluggable job types for running various data processing tasks.

Free

3.8

Apache Oozie

Hadoop Workflow Scheduler

A workflow scheduler system for managing Apache Hadoop jobs. Oozie supports MapReduce, Pig, Hive, and Sqoop jobs through a coordinator and workflow engine, enabling complex multi-stage data processing pipelines on Hadoop clusters.

Free

3.6

Bruin

End-to-End Data Pipeline Tool

An end-to-end data pipeline tool that combines ingestion, transformation using SQL and Python, and data quality checks in a single CLI. Bruin simplifies building and managing data pipelines with a unified developer experience.

Free

Meltano

CLI-First ELT Platform

A CLI and code-first ELT platform built by GitLab. Meltano uses the Singer protocol for data extraction and loading, integrating with dbt for transformations. It provides a declarative, version-controlled approach to managing data pipelines.

Free

4.3

Embulk

Bulk Data Loader

An open-source bulk data loader that helps transfer data between various databases, storages, file formats, and cloud services. Embulk supports parallel processing and plugin-based architecture for extensible data transfer pipelines.

Free

3.9

Sling

CLI Data Integration Tool

A CLI data integration tool specialized in moving data between databases and storage systems. Sling provides a simple command-line interface for extracting and loading data with support for incremental syncs, transformations, and multiple output formats.

Free

4.2

ingestr

Database-to-Database CLI Tool

A CLI tool to copy data between databases with a single command. ingestr supports 50+ sources and destinations, making it easy to move data between systems without writing custom integration code or managing complex configurations.

Free

4.1

Google Sheets ETL

Sheets to Data Warehouse Loader

An open-source tool for live importing all your Google Sheets to your data warehouse. Google Sheets ETL automates the extraction of spreadsheet data into structured tables, bridging the gap between business users and data infrastructure.

Free

3.7

DQOps

Open-Source Data Quality Platform

An open-source data quality platform for the whole data platform lifecycle. DQOps provides over 150 built-in data quality checks, anomaly detection, and data quality dashboards for monitoring data pipelines across multiple data sources.

Freemium

4.2

Grai

Data Catalog for CI/CD

A data catalog tool that integrates into your CI system to prevent data quality issues before they reach production. Grai maps data lineage across your stack and automatically tests the impact of schema changes on downstream consumers.

Free

daffy

DataFrame Contract Validation

A decorator-first DataFrame contracts and validation library for Python. daffy lets you define data contracts as decorators on functions, supporting Pandas, Polars, PyArrow, and Modin DataFrames for lightweight pipeline validation.

Free

3.8

Featured

Apache Superset

Modern BI Web Application

A modern, enterprise-ready business intelligence web application. Superset provides an intuitive interface for creating interactive dashboards, exploring data through SQL, and building rich visualizations without writing code.

Free

4.6

Redash

Data Visualization & Dashboards

An open-source tool for connecting to any data source, easily visualizing data, and sharing insights across your organization. Redash supports SQL queries, automated refreshes, and collaborative dashboard building.

Freemium

4.4

D3.js

Data-Driven Document Visualization

A JavaScript library for manipulating documents based on data. D3 enables powerful, flexible, and fully customizable data visualizations in the browser using HTML, SVG, and CSS, widely used for interactive dashboards and data storytelling.

Free

4.7

PyQtGraph

Scientific Graphics Library

A pure-Python graphics and GUI library built on PyQt and NumPy. PyQtGraph provides fast, interactive scientific graphics for displaying real-time data, including 2D/3D plots, image display, and data analysis widgets.

Free

4.1

QueryGPT

Natural Language Database Queries

A natural language database query interface with automatic chart generation. QueryGPT allows users to ask questions about their data in plain English and receive SQL queries, results, and visualizations automatically.

Free

3.8

Apache Gravitino

Unified Metadata Management

An open-source, unified metadata management platform for data lakes, data warehouses, and external catalogs. Gravitino provides a single point of access for managing metadata across diverse data sources, simplifying governance and discovery.

Free

PACE

Data Policy Enforcement Framework

An open-source framework that allows you to enforce agreements on how data should be accessed, used, and transformed. PACE provides policy-as-code capabilities for data governance, ensuring compliance across your data platform.

Free

3.8

Data Engineering Podcast

Podcast on Modern Data Infrastructure

A weekly podcast about modern data infrastructure, covering tools, techniques, and best practices in data engineering. The show features interviews with practitioners and creators of popular data tools and frameworks.

Free

4.5

The Data Stack Show

Data Engineering & Analytics Podcast

A podcast where hosts talk to data engineers, analysts, and data scientists about the tools and technologies shaping the modern data stack. Covers topics from data warehousing to analytics engineering and data governance.

Free

4.3

Data Council

Technical Data Conference

The first technical conference that bridges the gap between data scientists, data engineers, and data analysts. Data Council features hands-on talks and networking opportunities focused on practical data infrastructure challenges.

Paid

4.4

/r/ETL

Reddit ETL Community

A Reddit subreddit focused specifically on ETL (Extract, Transform, Load) topics. The community discusses ETL tools, best practices, pipeline design patterns, and troubleshooting data integration challenges.

Free

3.9

Featured

RabbitMQ

Open Source Message Broker

A robust, open-source message broker that supports multiple messaging protocols including AMQP, MQTT, and STOMP. RabbitMQ provides reliable message delivery with flexible routing, clustering, and federation for distributed data ingestion pipelines.

Free

4.6

Featured

Apache Pulsar

Distributed Pub-Sub Messaging

An open-source distributed pub-sub messaging system originally created by Yahoo. Pulsar provides multi-tenancy, geo-replication, and unified messaging and streaming with a serverless compute framework for lightweight processing.

Free

4.5

FluentD

Unified Logging Layer

An open-source data collector for building a unified logging layer. FluentD structures data as JSON and provides 500+ community-contributed plugins for connecting various data sources and outputs, widely used for log aggregation and forwarding.

Free

4.4

Apache Sqoop

Hadoop-RDBMS Data Transfer

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce for parallel data transfer with support for incremental imports and direct connector APIs.

Free

3.8

Apache Gobblin

Universal Data Ingestion Framework

A universal data ingestion framework for Hadoop from LinkedIn. Gobblin handles the complete data ingestion lifecycle including extraction, transformation, quality checks, and publishing for both batch and streaming data sources.

Free

3.9

Nakadi

Event Messaging Platform

An open-source event messaging platform that provides a REST API on top of Kafka-like queues. Nakadi simplifies event streaming by offering schema registration, data governance, and subscription-based consumption without direct Kafka client management.

Free

3.8

Pravega

Stream Storage System

An open-source storage system that provides a new abstraction — a stream — for continuous and unbounded data. Pravega offers auto-scaling, exactly-once semantics, and durable storage for building reliable streaming data ingestion pipelines.

Free

3.7

AWS Data Wrangler

AWS Data Utility Belt for Python

A utility belt for handling data on AWS using Python. AWS Data Wrangler extends Pandas with connectors to AWS services like S3, Glue, Athena, Redshift, and more, simplifying data ingestion and extraction in AWS-based pipelines.

Free

4.3

CsvPath Framework

Delimited Data Preboarding

A delimited data preboarding framework that fills the gap between managed file transfer and the data lake. CsvPath provides a domain-specific language for validating, transforming, and routing CSV and other delimited files before ingestion.

Free

3.7

Kreuzberg

Polyglot Document Intelligence

A polyglot document intelligence library with a Rust core and bindings for Python, TypeScript, Go, and more. Kreuzberg extracts text and structured data from documents like PDFs, images, and office files for data ingestion pipelines.

Free

3.8

db2lake

Database to Data Lake ETL

A lightweight Node.js ETL framework for moving data from databases to data lakes and data warehouses. db2lake provides simple configuration-driven extraction with support for incremental loads and multiple output formats.

Free

3.5

Featured

Apache Avro

Schema-Based Data Serialization

A data serialization system that provides rich data structures, a compact binary format, and schema evolution support. Avro is widely used in Apache Kafka ecosystems for encoding messages with schema registry integration.

Free

4.5

Featured

Apache Parquet

Columnar Storage Format

A columnar storage format available to any project in the Hadoop ecosystem. Parquet provides efficient compression and encoding schemes, making it the de facto standard for analytical workloads in data lakes and warehouses.

Free

4.8

Apache ORC

Optimized Row Columnar Format

The smallest, fastest columnar storage format for Hadoop workloads. ORC provides highly efficient compression, predicate pushdown, and ACID transaction support, making it ideal for Hive-based data warehousing.

Free

4.3

Apache Thrift

Cross-Language Services Framework

A software framework for scalable cross-language services development. Thrift combines a serialization format with an RPC framework, enabling efficient communication between services written in different programming languages.

Free

Featured

Protocol Buffers

Google's Data Interchange Format

Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. Protocol Buffers provide a compact binary format with strong typing and schema evolution, widely used in gRPC and high-performance data systems.

Free

4.7

Kryo

Fast JVM Serialization Framework

A fast and efficient object graph serialization framework for Java. Kryo is commonly used as the serialization backend for Apache Spark and other JVM-based data processing frameworks for high-performance data exchange.

Free

4.1

Featured

HDFS

Hadoop Distributed File System

A distributed file system designed to run on commodity hardware as part of the Apache Hadoop ecosystem. HDFS provides high-throughput access to application data and is the foundation for storing massive datasets in Hadoop-based data platforms.

Free

4.4

CEPH

Unified Distributed Storage

A unified, distributed storage system providing object, block, and file storage in a single platform. CEPH is designed for excellent performance, reliability, and scalability, widely used in cloud infrastructure and data center environments.

Free

4.4

JuiceFS

Cloud-Native File System

A high-performance, cloud-native file system driven by object storage. JuiceFS provides a POSIX-compatible interface backed by cloud storage like S3, making it easy to mount cloud storage as a local file system for data processing workloads.

Freemium

4.3

GlusterFS

Scalable Network File System

A scalable, distributed network file system suitable for data-intensive tasks such as cloud storage and media streaming. GlusterFS aggregates disk storage from multiple servers into a single global namespace for large-scale data access.

Free

SeaweedFS

Simple Distributed File System

A simple and highly scalable distributed file system designed for fast, efficient storage and retrieval of billions of files. SeaweedFS supports S3 API compatibility, erasure coding, and FUSE mounting for flexible data access.

Free

4.2

S3QL

Cloud-Backed File System

A file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL provides a standard POSIX file system interface with features like deduplication, compression, and encryption.

Free

3.8

LizardFS

Fault-Tolerant Distributed File System

A software-defined storage solution that is distributed, parallel, scalable, fault-tolerant, and geo-redundant. LizardFS provides a highly available file system with automatic data replication and self-healing capabilities.

Free

3.7

Featured

lakeFS

Git-Like Data Lake Versioning

An open-source platform that delivers resilience and manageability to object-storage-based data lakes. lakeFS provides git-like branching, merging, and versioning for data, enabling safe experimentation and CI/CD workflows for data pipelines.

Freemium

4.5

Project Nessie

Transactional Data Lake Catalog

A transactional catalog for data lakes with git-like semantics. Nessie works with Apache Iceberg tables to provide multi-table transactions, branching, tagging, and time-travel queries across your data lake.

Free

4.3

Featured

Data Profiler

Sensitive Data Detection & Profiling

A Python library by Capital One designed to make data analysis, monitoring, and sensitive data detection easy. Data Profiler automatically identifies data types, statistical patterns, and PII across structured and unstructured datasets.

Free

4.3

Desbordante

Advanced Data Pattern Discovery

An open-source data profiler focused on discovery and validation of complex patterns in data. Desbordante finds functional dependencies, association rules, and other data constraints that go beyond basic statistical profiling.

Free

Featured

Prometheus

Open-Source Monitoring System

An open-source systems monitoring and alerting toolkit with a powerful multi-dimensional data model and flexible query language (PromQL). Prometheus is the standard for monitoring cloud-native and Kubernetes-based data infrastructure.

Free

4.7

Featured

datacompy

DataFrame Comparison Library

A Python library by Capital One that facilitates the comparison of two DataFrames across Pandas, Polars, Spark, and more. datacompy provides detailed match reports with configurable tolerance levels, ideal for validating data pipeline outputs.

Free

4.2

#csv #batch-processing #machine-learning+3

Datasets (43)

NASA API

Access NASA's vast collection of data, including imagery, satellite data and information about space missions.

#rest-api #json #space+3

API

United Nations API

This is statistical data and information published by the United Nations Statistics Division (UNSD).

#rest-api #json #public-domain+1

API

National Weather Service API

The weather data and forecasts issued by the National Weather Service (NWS) of the United States.

#rest-api #json #weather+3

API

US Geological Survey (USGS) API

The earthquake data and information collected by the USGS Earthquake Hazards Program.

#rest-api #json #government+2

API

FoodData Central API

It provides REST access to FoodData Central (FDC). It is intended primarily to assist application developers wishing to incorporate nutrient data into their applications or websites.

#rest-api #json #entertainment+3

API

Systembolaget

The API makes it possible for you to get information about government-owned liquor stores in Sweden.

#rest-api #json #government+2

API

MusicBrainzAPI

An open-source database that collects information about music artists, releases, and tracks.

#rest-api #json #entertainment+3

API

OpenAQ API

Retrieve real-time and historical air quality data from locations around the world.

#rest-api #json #entertainment+5

API

OpenStreetMap API

Access open-source map data and perform geolocation services using OpenStreetMap.

#rest-api #json #entertainment+4

API

Bureau of Labor Statistics (BLS) API

Access labor market data and statistics from the US Department of Labor.

#rest-api #json #finance+3

API

Bureau of Economic Analysis (BEA) API

Access economic data and statistics for the United States from the BEA.

#rest-api #json #finance+3

API

Companies House API

The Companies House streaming API gives you access to realtime data changes of the information held at Companies House.

#rest-api #json #government+2

API

World Bank Data

The World Bank provides free access to a wide range of economic, social and environmental data.

#csv #batch-processing #finance+4

Download

Open Data Portal (USA)

Various governments and organizations maintain open data portals, offering access to government statistics, geospatial data and more.

Download

Open Data Portal (UK)

Various governments and organizations maintain open data portals, offering access to government statistics, geospatial data and more.

#csv #batch-processing #machine-learning+3

Download

UNICEF Data

Access data related to children's well-being, education, health and more from UNICEF.

Download

Federal Reserve Economic Data (FRED)

Access economic data from the Federal Reserve Bank of St. Louis.

Download

US Census Bureau Data

The US Census Bureau offers a wide range of demographic, economic and social datasets through its data portal.

#csv #batch-processing #finance+5

Download

World Health Organization (WHO) Data

The WHO offers a variety of datasets covering global health indicators, disease surveillance, health systems performance and epidemiological data.

#csv #batch-processing #health+3

Download

National Centers for Environmental Information (NCEI)

The NCEI, part of NOAA, provides access to a wide range of environmental datasets, including climate data, weather observations, oceanographic data and geophysical data.

#csv #batch-processing #weather+4

Download

Federal Election Commission (FEC) Data

The FEC provides access to campaign finance data, including information on political contributions, campaign expenditures, fundraising activities and financial disclosures filed by political candidates, parties and committees in the United States.

#csv #batch-processing #finance+4

Download

National Renewable Energy Laboratory (NREL) Data

The NREL offers datasets related to renewable energy resources, including solar, wind, biomass, geothermal and hydropower.

#csv #batch-processing #science+3

Download

New York City Open Data

It provides access to a diverse range of datasets covering demographics, transportation, public safety, housing, health, education and more.

#csv #batch-processing #health+5

Download

Bureau of Economic Analysis (BEA) Data

The BEA provides economic data and statistics for the United States, including measures of GDP, national income, consumer spending and trade balances.

#csv #batch-processing #machine-learning+1

Download

Amazon Customer Reviews Dataset

Hosted on the AWS Open Data Registry, it contains millions of product reviews submitted by Amazon customers.

Download

Federal Aviation Administration (FAA) Data

The FAA provides various datasets related to aviation, air traffic, airports and safety regulations in the United States.

#csv #batch-processing #transportation+4

Download

Bureau of Justice Statistics (BJS) Data

The BJS collects data on crime, criminal offenders, victims of crime and the operation of justice systems in the United States.

#csv #batch-processing #government+2

Download

US National Library of Medicine (NLM) Databases

The NLM hosts a variety of biomedical and health-related databases, including PubMed, MedlinePlus and GenBank.

#csv #batch-processing #free

Download

United Nations Development Programme (UNDP) Data

The UNDP offers a variety of datasets related to global development indicators, human development indices and sustainable development goals (SDGs).

#csv #batch-processing #entertainment+3

Download

US Department of Agriculture (USDA) Data

It provides access to a wide range of datasets related to agriculture, food, nutrition and rural development.

#csv #batch-processing #government+2

Download

European Union Open Data Portal

The Portal offers access to a wide range of datasets from EU institutions and agencies.

Download

Bureau of Transportation Statistics (BTS) Data

The BTS offers datasets related to transportation and mobility in the United States.

#csv #batch-processing #transportation+3

Download

Bureau of Labor Statistics (BLS) Data

The United States Department of Labor provides a wide range of datasets on labor market conditions, employment trends, wages, prices, productivity and workplace safety.

#csv #batch-processing #machine-learning+2

Download

Hugging Face Datasets

A library that provides access to a wide range of datasets for natural language processing (NLP) tasks.

Download

CDC Wonder

CDC Wonder provides access to a wide range of public health-related datasets, including mortality data, disease surveillance, birth data and more.

Download

UNESCO Institute for Statistics (UIS) Data

The UIS offers datasets covering education, literacy, science, technology and innovation, culture and communication statistics worldwide.

#csv #batch-processing #education+4

Download

Global Land Data Assimilation System (GLDAS)

GLDAS provides datasets on land surface conditions, including soil moisture, temperature, precipitation and other hydrological variables, derived from satellite and ground-based observations.

#csv #batch-processing #weather+4

Download

Natural Earth Data

Natural Earth provides public domain map datasets at various scales, covering physical and cultural features such as coastlines, rivers, cities and political boundaries.

#csv #batch-processing #maps+3

Download

Global Health Observatory Data Repository

The WHO Global Health Observatory offers datasets on a wide range of health-related indicators, including disease prevalence, mortality rates, healthcare access and more.

Download

UNICEF Childinfo

UNICEF Childinfo provides datasets on child well-being, education, health, nutrition, child protection and other child-related indicators worldwide.

Download

World Bank World Development Indicators (WDI)

The World Bank WDI offers datasets on global development indicators, including GDP, population, poverty, education, health and infrastructure.

#csv #batch-processing #health+5

Download

United Nations Conference on Trade and Development (UNCTAD) Data

UNCTAD provides datasets on trade, investment, development, globalization, economic indicators and other aspects of international trade and development.