Access repositories, commits, pull requests, issues, users, and organisation data from GitHub. Ideal for building developer analytics pipelines, tracking open-source project activity, and ingesting code metadata into data warehouses using Python and the PyGitHub library.
The `PyGithub` library provides a clean Python interface to GitHub's REST API. Engineers use it to batch-collect repository metadata, commit history, and issue threads, storing results in BigQuery or Snowflake for analysis.
GitHub data powers code-understanding AI: training datasets for code completion models, commit message classifiers, and bug-pattern detectors. You can expose the GitHub API as an MCP server so agents can search repositories, read files, and summarize PR discussions in natural language.
# pip install PyGithub
from github import Github
g = Github("YOUR_GITHUB_TOKEN")
repo = g.get_repo("apache/airflow")
for issue in repo.get_issues(state="open"):
print(issue.title)Official dataset source
More datasets used by Python data engineers.
Access YouTube video metadata, channel statistics, playlist data, comments, captions, and trending content. Used in data pipelines for social media analytics, content trend monitoring, comment sentiment analysis, and building video performance dashboards using the Google API Python client.
Access music metadata, audio features (tempo, energy, danceability), playlist data, artist catalogues, and listening history from the Spotify platform. Used in data engineering for building music recommendation systems, audio feature datasets, and trend analysis pipelines with the spotipy Python library.
Retrieve tweets, user profiles, trends, and engagement metrics from the Twitter/X platform via its REST and streaming APIs. Useful for social media analytics pipelines, sentiment analysis, and building real-time data streams with Python using the Tweepy library.