How can I access GitHub Datasets?

GitHub Datasets is available as a downloadable dataset at https://github.com/awesomedata/awesome-public-datasets

What can I build with GitHub Datasets?

Find curated dataset repositories maintained by the open-source community. Access raw CSV, JSON, and Parquet files directly via GitHub's raw content URLs. Use GitHub Actions to automate dataset versioning and update pipelines. Contribute to and fork community dataset repositories for collaboration

GitHub Datasets

Dataset Downloads

About This Dataset

Thousands of publicly available datasets hosted on GitHub repositories covering social media, finance, healthcare, sports, and scientific domains. Accessible directly via the GitHub API or raw download URLs, making them ideal for practising version-controlled data ingestion and automated dataset pipelines in Python.

What You Can Build

1Find curated dataset repositories maintained by the open-source community
2Access raw CSV, JSON, and Parquet files directly via GitHub's raw content URLs
3Use GitHub Actions to automate dataset versioning and update pipelines
4Contribute to and fork community dataset repositories for collaboration

How Python Data Engineers Use GitHub Datasets

Engineers download GitHub-hosted datasets with `requests.get('https://raw.githubusercontent.com/...')` or use the GitHub API to list repository files. The `pandas.read_csv()` function can directly read from raw GitHub URLs without local download.

GitHub Datasets for LLM Fine-Tuning and RAG Pipelines

GitHub is home to community-curated AI training datasets, benchmark collections, and domain-specific corpora not found in official repositories. Many NLP datasets, annotation projects, and labeled image sets live on GitHub — use the API to automate dataset discovery and download for your AI pipelines.

Python Example

# pip install pandas requests
import pandas as pd

# Load a CSV dataset directly from a GitHub raw URL
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv"
df = pd.read_csv(url)
print(df.head())
print(f"Shape: {df.shape}")

Access Dataset

Official dataset source

Dataset Info

Category:Dataset Downloads

Type:Direct Download

Tags:

#csv #batch-processing #health #social-media #news #machine-learning #oauth #api-key-required

Related Datasets

More datasets used by Python data engineers.

Wikipedia Dumps

Regular XML snapshots of all Wikipedia articles, talk pages, and revision histories available for bulk download. Used in data engineering for building large-scale NLP corpora, knowledge graph extraction, full-text search indices, and training language models with Python processing tools like WikiExtractor.

Kaggle COVID-19 Dataset

The Kaggle COVID-19 Dataset, curated by the Allen Institute for AI, aggregates a comprehensive collection of research articles, datasets and other resources related to the COVID-19 pandemic.

Eurostat Data

Eurostat, the statistical office of the European Union, offers a comprehensive database of statistical data covering various domains such as economy, population, employment, environment and social issues.