The Kaggle COVID-19 Dataset, curated by the Allen Institute for AI, aggregates a comprehensive collection of research articles, datasets and other resources related to the COVID-19 pandemic.
Kaggle COVID-19 datasets are downloadable via the `kaggle` CLI or direct CSV download. Engineers use `pandas.read_csv()` to load time-series data, then apply `groupby` and `rolling()` for wave detection and trend smoothing. Daily updates were available during the pandemic.
COVID-19 datasets trained some of the earliest real-world epidemic forecasting AI models. Use this historical pandemic data to fine-tune time-series models for disease spread prediction, or build AI systems that explain pandemic policy decisions using RAG indexed on case data and intervention timelines.
# pip install kaggle pandas
import subprocess, pandas as pd
subprocess.run(["kaggle", "datasets", "download",
"-d", "imdevskp/corona-virus-report", "--unzip", "-p", "/tmp/covid"])
df = pd.read_csv("/tmp/covid/country_wise_latest.csv")
print(df.nlargest(10, "Confirmed")[["Country/Region", "Confirmed", "Deaths"]])Official dataset source
More datasets used by Python data engineers.
Thousands of publicly available datasets hosted on GitHub repositories covering social media, finance, healthcare, sports, and scientific domains. Accessible directly via the GitHub API or raw download URLs, making them ideal for practising version-controlled data ingestion and automated dataset pipelines in Python.
Regular XML snapshots of all Wikipedia articles, talk pages, and revision histories available for bulk download. Used in data engineering for building large-scale NLP corpora, knowledge graph extraction, full-text search indices, and training language models with Python processing tools like WikiExtractor.
Eurostat, the statistical office of the European Union, offers a comprehensive database of statistical data covering various domains such as economy, population, employment, environment and social issues.