Explore 199 tools and 43 datasets tagged with Free for Python data engineering.
Python Data Loading Library
Python library that facilitates the loading phase in ETL processes. Designed to simplify loading data into various data stores or processing systems.
Transform Data in Your Warehouse
Open-source transformation tool enabling data analysts and engineers to transform, test, and document data in the warehouse. Focuses on the transform part of ETL with SQL templating and Python scripting.
Workflow Orchestration Platform
Platform to programmatically author, schedule, and monitor workflows. Allows for complex pipeline construction and efficient task management with robust dependency handling.
Data Flow Automation
Easy-to-use, powerful, and reliable system to process and distribute data, offering a web-based user interface for data flow management.
Kubernetes-Native Workflow Engine
Open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Designed for large-scale computational tasks with powerful workflow features.
Web Scraping & HTML Parsing
Library for web scraping and parsing HTML/XML documents. Extensively used in data wrangling to clean, parse, and extract data from web sources.
Object Serialization & Validation
ORM/ODM/framework-agnostic library for object serialization and deserialization. Converts complex data types to and from native Python datatypes with robust validation.
Python Data Structure Validation
Validates Python data structures with straightforward syntax and clear error messages. Ensures structure and content adhere to specified schemas.
JSON Schema Validator
Library for validating JSON data against JSON Schema standards. Essential when working with JSON data formats to ensure schema compliance.
Python SQL Toolkit & ORM
Widely used ORM library providing a full suite of enterprise-level persistence patterns. Designed for efficient, high-performing database access with flexible SQL abstraction.
Django's Built-in ORM
Part of Django web framework, allows defining data models entirely in Python. Provides powerful abstraction layer to translate Python code to SQL seamlessly.
Async ORM for Python
Easy-to-use asyncio ORM inspired by Django. Designed for async/await syntax, making it perfect for asynchronous applications and modern Python development.
Built-in Django Migration Framework
Django's powerful built-in migration framework that comes bundled with Django. Allows you to change your database schema without losing data using a simple and intuitive API.
Database Migrations for Flask
Extension that handles SQLAlchemy database migrations for Flask applications using Alembic. Provides command-line tools to manage and automate database migrations in Flask projects.
Database Schema Migration Tool
Database schema migration tool that lets you manage your database schema by applying and rolling back migration scripts written in pure SQL or Python. Simple and flexible approach to database migrations.
Schema Versioning for SQLAlchemy
Provides a way to deal with database schema changes in SQLAlchemy projects. Extends SQLAlchemy to have database schema versioning and migration capabilities for managing database evolution.
Distributed Event Streaming Platform
Distributed event streaming platform capable of handling trillions of events a day. Used for building real-time streaming data pipelines and applications with high-throughput, fault-tolerance, and scalability.
Stream Processing Framework
Framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Known for high performance in streaming data processing with exactly-once semantics.
Real-Time Computation System
Real-time computation system making it easy to process unbounded streams of data reliably. Fast and scalable distributed real-time computation framework for stream processing.
Scalable Stream Processing
Extension of Apache Spark API enabling scalable, high-throughput, fault-tolerant processing of live data streams. Integrated within Spark ecosystem for complex real-time data processing tasks.
Powerful API Toolkit for Django
Powerful and flexible toolkit for building Web APIs in Django. Highly recommended for adding API capabilities to Django applications with comprehensive features and excellent documentation.
Data Validation & Documentation
Comprehensive tool helping data teams validate, document, and profile their data. Define expectations for your data ensuring it meets quality standards before processing.
Automated Data Profiling
Generates profile reports from pandas DataFrames. Excellent tool for quickly understanding data with interactive HTML reports including statistics, distributions, and correlations.
Automated Data Cleaning
Automatic tool for cleaning and preprocessing data. Handles missing values, encodes categorical data, and scales features making data preparation efficient.
Schema Validation Tool
Python package for automated data validation within Data Engineering pipelines. Engineered to ingest and validate tabular data against predefined schemas.
Comprehensive Visualization Library
Comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib is versatile and widely used for plotting graphs and charts with extensive customization options.
Machine Learning in Python
Versatile library providing a range of supervised and unsupervised learning algorithms. Known for its ease of use and efficiency for data mining and data analysis with classical ML algorithms.
End-to-End ML Platform
End-to-end open-source platform for machine learning enabling complex computations with data flow graphs. Widely used for deep learning applications with robust production support.
Distributed Storage and Processing Framework
Framework that allows for distributed processing of large datasets across clusters of computers using simple programming models. Designed to scale from single servers to thousands of machines, each offering local computation and storage. Uses HDFS for distributed storage and MapReduce for processing.
Unified Batch and Stream Processing
Advanced unified programming model for defining and executing data processing workflows that can run on any execution engine. Provides portability across multiple execution environments including Apache Flink, Apache Spark, and Google Cloud Dataflow. Ideal for building flexible, scalable data pipelines.
AWS SDK for Python
The official Amazon Web Services (AWS) SDK for Python. Enables Python developers to write software that makes use of services like Amazon S3, EC2, Lambda, and more. Provides easy-to-use, object-oriented API as well as low-level access to AWS services, making it simple to integrate Python applications with AWS infrastructure.
GCP SDK for Python
Google Cloud Platform's official client library for Python, enabling seamless integration with GCP services like Compute Engine, Cloud Storage, BigQuery, and Pub/Sub. Designed for a Pythonic, intuitive experience when interacting with Google Cloud services, with idiomatic code patterns and comprehensive documentation.
Microsoft Azure SDK
Microsoft's comprehensive Azure SDK for Python offering a complete set of packages to interact with Azure resources and services. Supports wide range of Azure services including Virtual Machines, Storage, Databases, AI services, and more. Provides tools for effective resource management and service interaction within Azure ecosystem.
IBM Cloud Services SDK
Official SDK for interacting with various IBM Cloud services programmatically. Provides comprehensive support for IBM Cloud services including CIS, DNS, IAM, VPC, Watson AI, and more. Enables management and automation of IBM Cloud resources with Python, compatible with Python 3.6 and above.
OCI SDK for Python
Official SDK for writing code to manage Oracle Cloud Infrastructure resources. Supports wide range of Oracle Cloud services with functionalities for compute, storage, networking, databases, and more. Available across multiple operating systems and Python versions, providing robust interface for OCI resource management.
Database Design as Code
Free, simple tool to draw Entity-Relationship diagrams by just writing code. Designed to help developers design and visualize database structures in a straightforward and intuitive way. Perfect for quickly sketching database schemas and sharing them with your team through simple DSL syntax.
MySQL Database Design Tool
Integrated tool provided by MySQL for database design, modeling, administration, and maintenance. Provides visual interface for creating, managing, and analyzing MySQL databases. Includes data modeling, SQL development, and comprehensive administration tools for MySQL database systems.
ER Diagrams from SQLAlchemy
Python library designed to create Entity Relationship diagrams by extracting data from databases or SQLAlchemy models. Particularly useful for database designers and developers who need to visualize and interpret complex relationships within database systems. Generates diagrams automatically from your Python code.
Open Source Diagramming
Free and open-source diagramming tool that can be used to create Entity-Relationship diagrams. Versatile application suitable for simple modeling tasks, flowcharts, network diagrams, and database schemas. Lightweight alternative for developers who need basic ER diagram functionality.
Advanced Open Source Database
Powerful, open-source object-relational database system known for reliability, feature robustness, and performance. Widely used in Python community with excellent support for advanced data types, JSON, full-text search, and performance optimization. ACID-compliant with strong community and enterprise adoption.
In-Memory Data Store
Open-source, in-memory data structure store used as database, cache, and message broker. Supports various data structures including strings, hashes, lists, sets, sorted sets, and streams. Provides high performance, sub-millisecond latency, and is widely used for caching, session management, and real-time analytics.
Distributed Wide-Column Store
Highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers with no single point of failure. Provides high availability and linear scalability. Ideal for applications requiring continuous availability and massive write throughput.
Enterprise Data Governance
Scalable and extensible set of core foundational governance services for Hadoop ecosystem and enterprise data. Enables organizations to effectively meet compliance requirements with metadata management, data classification, and lineage tracking. Integrates with Python through REST APIs for governance automation.
Data Discovery & Metadata Engine
Data discovery and metadata engine for improving productivity of data analysts, scientists, and engineers when interacting with data. Provides powerful search, data previews, and column-level lineage. Integrates seamlessly with Python environments and modern data stacks for comprehensive metadata management.
Open Data Management System
Powerful data management system that makes data accessible by providing tools to streamline publishing, sharing, finding, and using data. Aimed at data publishers wanting to make their data open and available. Features data cataloging, API generation, and visualization capabilities.
Metadata Service for Data Lineage
Open-source metadata service for collection, aggregation, and visualization of data ecosystem metadata. Provides common interface to track data lineage across your entire data platform. Offers Python client for integration and supports OpenLineage standard for lineage collection.
Modern Metadata Platform
Open-source metadata platform for the modern data stack. Provides powerful and flexible metadata search, discovery, and lineage capabilities. Features real-time metadata updates, data quality monitoring, and governance workflows. Extensive Python SDK for automation and integration.
Q&A for Data Engineers
Vast community of developers and IT professionals with extensive data engineering questions and answers. Rich resource for troubleshooting, learning from real-world problems, and discovering solutions. Active community providing quick responses to technical challenges in Python data engineering.
Data Engineering Subreddit
Dedicated subreddit for data engineering professionals and enthusiasts. Active discussions on trends, articles, questions, and insights related to data engineering including Python-specific topics. Great for staying updated with industry news and community wisdom.
Analytics Engineering Hub
Vibrant Slack community focused on dbt and modern data practices. Fantastic place for data engineers to discuss analytics engineering, share experiences, and find support on various data topics including Python integrations. Active community with thousands of members.
Data Science Competition Platform
World's largest data science community featuring competitions, datasets, and collaborative notebooks. Members share code, discuss methodologies, and collaborate on projects. Excellent for finding practical Python examples, innovative solutions, and learning from top data scientists.
LinkedIn Professional Network
LinkedIn group where data engineering professionals share articles, discuss industry trends, and network. Members engage in discussions, share insights, and connect for career opportunities. Great for professional networking and staying informed about industry developments.
Analytics Community
Community dedicated to operational analytics, offering resources and discussions to help professionals leverage data for operational decision-making. Focuses on practical applications of analytics in business operations and real-time data systems.
AI Data Quality Focus
Community focusing on data-centric aspects of AI development. Provides resources and discussions on improving data quality and processes in AI projects. Emphasizes the importance of high-quality data over just algorithms in machine learning success.
MLOps Community Chat
Active Discord community focused on Machine Learning Operations (MLOps). Members discuss best practices, tools, and strategies for deploying and maintaining ML systems. Great for real-time discussions on ML engineering, monitoring, and production systems.
Data Community & Events
Community of data enthusiasts sharing knowledge through talks, discussions, and events. Offers free courses, weekly events, and active Slack community. Covers data engineering, machine learning, and analytics with hands-on learning opportunities.
MLOps Learning Hub
Open and inclusive community for individuals interested in Machine Learning Operations. Provides resources, discussions, podcasts, and events for implementing MLOps practices. Focuses on bridging gap between ML development and production deployment.
Data Leaders Community
Community for data professionals offering insightful content, discussions, and resources. Focuses on data analytics leadership, career growth, and collaboration. Features newsletter, Slack community, and blog posts from experienced data leaders.
DE Community Chat
Discord server dedicated to data engineering with active discussions on data infrastructure, architectures, pipelines, and best practices. Real-time community support for troubleshooting, career advice, and learning. Great for connecting with other data engineers globally.
Data Analysis & Manipulation
Foundational library for data manipulation and analysis in Python. Provides fast, flexible, and expressive data structures (DataFrames) designed for working with structured, tabular, and time series data. Essential tool for data wrangling with comprehensive features for indexing, grouping, merging, and filtering.
Data Cleaning & Transformation
Powerful tool for working with messy data, cleaning it, transforming from one format to another, and extending it with web services or external data. Although not a Python library, it's valuable for advanced data wrangling alongside Python tools.
Lightweight Async ORM
Lightweight and async-ready ORM designed to work with FastAPI and Starlette. Particularly suited for applications requiring asynchronous database operations with minimal overhead and modern Python async/await patterns.
Programming Language
Python is a high-level, interpreted programming language that has become the dominant language for data engineering. Known for its clear syntax, extensive standard library, and rich ecosystem of data-focused packages. Essential foundation for all Python data engineering work.
Code Editor & IDE
Powerful, free code editor with excellent Python support through extensions. Features IntelliSense, debugging, Git integration, and a vast marketplace of extensions. The most popular IDE for Python data engineering with powerful features for managing virtual environments and running code.
Virtual Environment Manager
Tools for creating isolated Python environments, allowing you to manage project-specific dependencies without conflicts. venv comes built into Python 3, while virtualenv offers additional features. Critical for professional Python development and maintaining clean, reproducible environments.
Containerization Platform
Industry-standard platform for developing, shipping, and running applications in containers. Essential for data engineering to run databases, Kafka, and other services in isolated, reproducible environments. Docker Desktop provides an easy-to-use interface for managing containers across all operating systems.
Multi-Container Orchestration
Tool for defining and running multi-container Docker applications using YAML configuration files. Perfect for data engineering workflows that require multiple services like databases, message queues, and processing engines running together. Simplifies complex container setups into simple, version-controlled configurations.
Distributed Column-Family Store
A distributed, scalable big data store modeled after Google's Bigtable, running on top of HDFS. HBase provides random, real-time read/write access to large datasets and is commonly used for storing sparse data in the Hadoop ecosystem.
Fast Columnar OLAP Database
An open-source columnar database management system designed for online analytical processing (OLAP). ClickHouse delivers exceptional query performance on large datasets, making it ideal for real-time analytics, log analysis, and time-series data.
Distributed Columnar Streaming Database
A distributed, columnar, versioned, and streaming database designed for real-time and batch analytics. FiloDB combines the benefits of columnar storage with streaming ingestion, making it suitable for time-series and event data workloads.
Distributed In-Memory Database
An open-source, distributed, in-memory database providing reliable asynchronous event notifications and guaranteed message delivery. Apache Geode pools memory, CPU, network resources, and local disk storage across multiple processes for high-performance data management.
Fast SQL Time Series Database
A relational column-oriented database designed for real-time analytics on time series and event data. QuestDB uses SQL with time-series extensions and delivers exceptional ingestion performance, ideal for financial data, IoT, and application metrics.
Real-Time Analytics Database
A column-oriented, distributed data store designed for sub-second OLAP queries on event data. Druid is used for powering interactive analytical applications, real-time dashboards, and exploratory analytics on high-cardinality data.
Distributed Stream Processing Framework
A distributed stream processing framework that uses Apache Kafka for messaging and Apache Hadoop YARN for fault tolerance, processor isolation, security, and resource management. Samza provides a simple API for building stateful stream processing applications.
Incremental Data Processing Framework
An open-source framework for managing storage for real-time data processing on top of data lakes. Hudi provides record-level insert, update, and delete capabilities along with change streams, enabling incremental data pipelines on large-scale datasets.
Streaming SQL Database
A streaming SQL database that runs SQL queries continuously on incoming data streams. PipelineDB is built as a PostgreSQL extension, allowing you to use standard SQL to define continuous views over streaming data for real-time analytics.
Real-Time Streaming Data Platform
A framework for building real-time streaming data processing applications. SwimOS combines streaming data processing with a built-in state store and UI capabilities, enabling continuous intelligence applications that process and visualize data in real-time.
DAG-Based Processing Framework
An application framework for complex directed-acyclic-graph (DAG) based data processing tasks, built on top of Apache Hadoop YARN. Tez generalizes MapReduce to enable more efficient data processing pipelines with fewer read/write cycles.
Data Warehouse on Hadoop
A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides a SQL-like interface (HiveQL) for querying data stored in Hadoop's HDFS and other compatible systems.
Schema-Free SQL Query Engine
A schema-free SQL query engine for Hadoop, NoSQL, and cloud storage. Drill enables analysts and data scientists to query self-describing data like JSON, Parquet, and CSV without requiring predefined schemas or ETL transformations.
Distributed Machine Learning
An environment for quickly creating scalable, performant machine learning applications. Mahout provides mathematically expressive Scala DSL and supports Apache Spark and Apache Flink backends for distributed linear algebra operations.
Spark's Machine Learning Library
Apache Spark's scalable machine learning library consisting of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, and dimensionality reduction. MLlib integrates seamlessly with Spark's data processing pipelines.
Spark's Graph Processing API
Apache Spark's API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a graph abstraction, providing a set of fundamental operators and optimized algorithms for graph analytics like PageRank and connected components.
Large-Scale Graph Processing
An iterative graph processing system built for high scalability, used at Facebook to analyze the social graph. Giraph processes billions of vertices and edges efficiently on Hadoop infrastructure using a vertex-centric programming model.
Python Data Pipeline Framework
An open-source Python framework for creating reproducible, maintainable, and modular data science code. Kedro applies software engineering best practices to data pipelines with built-in data catalog, pipeline visualization, and experiment tracking.
Event-Driven Orchestration Platform
A scalable, event-driven, language-agnostic orchestration and scheduling platform. Kestra provides a declarative YAML-based workflow definition with a rich UI, supporting hundreds of plugins for data engineering, DevOps, and microservice orchestration.
Data Transformation Framework
An open-source data transformation framework for managing, testing, and deploying SQL and Python-based data pipelines. SQLMesh provides virtual data environments, automatic change detection, and incremental processing for efficient data warehouse management.
Hadoop Workflow Scheduler
A workflow scheduler system for managing Apache Hadoop jobs. Oozie supports MapReduce, Pig, Hive, and Sqoop jobs through a coordinator and workflow engine, enabling complex multi-stage data processing pipelines on Hadoop clusters.
CLI Data Integration Tool
A CLI data integration tool specialized in moving data between databases and storage systems. Sling provides a simple command-line interface for extracting and loading data with support for incremental syncs, transformations, and multiple output formats.
Sheets to Data Warehouse Loader
An open-source tool for live importing all your Google Sheets to your data warehouse. Google Sheets ETL automates the extraction of spreadsheet data into structured tables, bridging the gap between business users and data infrastructure.
Modern BI Web Application
A modern, enterprise-ready business intelligence web application. Superset provides an intuitive interface for creating interactive dashboards, exploring data through SQL, and building rich visualizations without writing code.
Unified Metadata Management
An open-source, unified metadata management platform for data lakes, data warehouses, and external catalogs. Gravitino provides a single point of access for managing metadata across diverse data sources, simplifying governance and discovery.
Podcast on Modern Data Infrastructure
A weekly podcast about modern data infrastructure, covering tools, techniques, and best practices in data engineering. The show features interviews with practitioners and creators of popular data tools and frameworks.
Data Engineering & Analytics Podcast
A podcast where hosts talk to data engineers, analysts, and data scientists about the tools and technologies shaping the modern data stack. Covers topics from data warehousing to analytics engineering and data governance.
Technical Data Conference
The first technical conference that bridges the gap between data scientists, data engineers, and data analysts. Data Council features hands-on talks and networking opportunities focused on practical data infrastructure challenges.
Open Source Message Broker
A robust, open-source message broker that supports multiple messaging protocols including AMQP, MQTT, and STOMP. RabbitMQ provides reliable message delivery with flexible routing, clustering, and federation for distributed data ingestion pipelines.
Distributed Pub-Sub Messaging
An open-source distributed pub-sub messaging system originally created by Yahoo. Pulsar provides multi-tenancy, geo-replication, and unified messaging and streaming with a serverless compute framework for lightweight processing.
Hadoop-RDBMS Data Transfer
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce for parallel data transfer with support for incremental imports and direct connector APIs.
Universal Data Ingestion Framework
A universal data ingestion framework for Hadoop from LinkedIn. Gobblin handles the complete data ingestion lifecycle including extraction, transformation, quality checks, and publishing for both batch and streaming data sources.
Event Messaging Platform
An open-source event messaging platform that provides a REST API on top of Kafka-like queues. Nakadi simplifies event streaming by offering schema registration, data governance, and subscription-based consumption without direct Kafka client management.
AWS Data Utility Belt for Python
A utility belt for handling data on AWS using Python. AWS Data Wrangler extends Pandas with connectors to AWS services like S3, Glue, Athena, Redshift, and more, simplifying data ingestion and extraction in AWS-based pipelines.
Delimited Data Preboarding
A delimited data preboarding framework that fills the gap between managed file transfer and the data lake. CsvPath provides a domain-specific language for validating, transforming, and routing CSV and other delimited files before ingestion.
Schema-Based Data Serialization
A data serialization system that provides rich data structures, a compact binary format, and schema evolution support. Avro is widely used in Apache Kafka ecosystems for encoding messages with schema registry integration.
Columnar Storage Format
A columnar storage format available to any project in the Hadoop ecosystem. Parquet provides efficient compression and encoding schemes, making it the de facto standard for analytical workloads in data lakes and warehouses.
Optimized Row Columnar Format
The smallest, fastest columnar storage format for Hadoop workloads. ORC provides highly efficient compression, predicate pushdown, and ACID transaction support, making it ideal for Hive-based data warehousing.
Cross-Language Services Framework
A software framework for scalable cross-language services development. Thrift combines a serialization format with an RPC framework, enabling efficient communication between services written in different programming languages.
Google's Data Interchange Format
Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. Protocol Buffers provide a compact binary format with strong typing and schema evolution, widely used in gRPC and high-performance data systems.
Git-Like Data Lake Versioning
An open-source platform that delivers resilience and manageability to object-storage-based data lakes. lakeFS provides git-like branching, merging, and versioning for data, enabling safe experimentation and CI/CD workflows for data pipelines.
Transactional Data Lake Catalog
A transactional catalog for data lakes with git-like semantics. Nessie works with Apache Iceberg tables to provide multi-table transactions, branching, tagging, and time-travel queries across your data lake.
Sensitive Data Detection & Profiling
A Python library by Capital One designed to make data analysis, monitoring, and sensitive data detection easy. Data Profiler automatically identifies data types, statistical patterns, and PII across structured and unstructured datasets.
Advanced Data Pattern Discovery
An open-source data profiler focused on discovery and validation of complex patterns in data. Desbordante finds functional dependencies, association rules, and other data constraints that go beyond basic statistical profiling.
Open-Source Monitoring System
An open-source systems monitoring and alerting toolkit with a powerful multi-dimensional data model and flexible query language (PromQL). Prometheus is the standard for monitoring cloud-native and Kubernetes-based data infrastructure.
This is statistical data and information published by the United Nations Statistics Division (UNSD).
The weather data and forecasts issued by the National Weather Service (NWS) of the United States.
The earthquake data and information collected by the USGS Earthquake Hazards Program.
It provides REST access to FoodData Central (FDC). It is intended primarily to assist application developers wishing to incorporate nutrient data into their applications or websites.
The API makes it possible for you to get information about government-owned liquor stores in Sweden.
An open-source database that collects information about music artists, releases, and tracks.
Retrieve real-time and historical air quality data from locations around the world.
Access open-source map data and perform geolocation services using OpenStreetMap.
Access labor market data and statistics from the US Department of Labor.
Access economic data and statistics for the United States from the BEA.
The Companies House streaming API gives you access to realtime data changes of the information held at Companies House.
The World Bank provides free access to a wide range of economic, social and environmental data.
Various governments and organizations maintain open data portals, offering access to government statistics, geospatial data and more.
Various governments and organizations maintain open data portals, offering access to government statistics, geospatial data and more.
Access data related to children's well-being, education, health and more from UNICEF.
Access economic data from the Federal Reserve Bank of St. Louis.
The US Census Bureau offers a wide range of demographic, economic and social datasets through its data portal.
The WHO offers a variety of datasets covering global health indicators, disease surveillance, health systems performance and epidemiological data.
The NCEI, part of NOAA, provides access to a wide range of environmental datasets, including climate data, weather observations, oceanographic data and geophysical data.
The FEC provides access to campaign finance data, including information on political contributions, campaign expenditures, fundraising activities and financial disclosures filed by political candidates, parties and committees in the United States.
The NREL offers datasets related to renewable energy resources, including solar, wind, biomass, geothermal and hydropower.
It provides access to a diverse range of datasets covering demographics, transportation, public safety, housing, health, education and more.
The BEA provides economic data and statistics for the United States, including measures of GDP, national income, consumer spending and trade balances.
Hosted on the AWS Open Data Registry, it contains millions of product reviews submitted by Amazon customers.
The FAA provides various datasets related to aviation, air traffic, airports and safety regulations in the United States.
The BJS collects data on crime, criminal offenders, victims of crime and the operation of justice systems in the United States.
The NLM hosts a variety of biomedical and health-related databases, including PubMed, MedlinePlus and GenBank.
The UNDP offers a variety of datasets related to global development indicators, human development indices and sustainable development goals (SDGs).
It provides access to a wide range of datasets related to agriculture, food, nutrition and rural development.
The Portal offers access to a wide range of datasets from EU institutions and agencies.
The BTS offers datasets related to transportation and mobility in the United States.
The United States Department of Labor provides a wide range of datasets on labor market conditions, employment trends, wages, prices, productivity and workplace safety.
A library that provides access to a wide range of datasets for natural language processing (NLP) tasks.
CDC Wonder provides access to a wide range of public health-related datasets, including mortality data, disease surveillance, birth data and more.
The UIS offers datasets covering education, literacy, science, technology and innovation, culture and communication statistics worldwide.
GLDAS provides datasets on land surface conditions, including soil moisture, temperature, precipitation and other hydrological variables, derived from satellite and ground-based observations.
Natural Earth provides public domain map datasets at various scales, covering physical and cultural features such as coastlines, rivers, cities and political boundaries.
The WHO Global Health Observatory offers datasets on a wide range of health-related indicators, including disease prevalence, mortality rates, healthcare access and more.
UNICEF Childinfo provides datasets on child well-being, education, health, nutrition, child protection and other child-related indicators worldwide.
The World Bank WDI offers datasets on global development indicators, including GDP, population, poverty, education, health and infrastructure.
UNCTAD provides datasets on trade, investment, development, globalization, economic indicators and other aspects of international trade and development.
EOSDIS provides access to a wide range of Earth observation datasets, including satellite imagery, climate data, land cover, oceanography and atmospheric data.