GDELT offers datasets on global events, including news articles, social media posts, protests, conflicts and other geopolitical events extracted from a variety of sources.
GDELT data is available as 15-minute TSV files on Google Cloud Storage. Engineers use BigQuery's `gdelt-bq` public dataset for SQL-based querying at scale, or download raw event files with `requests` and process with `pandas` for smaller analyses.
GDELT is one of the largest open datasets for training multilingual news AI. Its real-time event stream powers geopolitical AI systems that detect emerging crises. Use GDELT's article URLs to build a massive web-scraped NLP corpus for LLM pre-training, or fine-tune conflict prediction models on its event categories.
# pip install pandas google-cloud-bigquery
from google.cloud import bigquery
client = bigquery.Client(project="YOUR_PROJECT_ID")
query = (
"SELECT ActionGeo_CountryCode, COUNT(*) as events, AVG(AvgTone) as avg_tone "
"FROM `gdelt-bq.gdeltv2.events` "
"WHERE SQLDATE >= 20240101 "
"GROUP BY ActionGeo_CountryCode ORDER BY events DESC LIMIT 15"
)
df = client.query(query).to_dataframe()
print(df)Official dataset source
More datasets used by Python data engineers.
Thousands of publicly available datasets hosted on GitHub repositories covering social media, finance, healthcare, sports, and scientific domains. Accessible directly via the GitHub API or raw download URLs, making them ideal for practising version-controlled data ingestion and automated dataset pipelines in Python.
Access demographic, economic, social, and geographic datasets from the US Census Bureau including the American Community Survey, decennial census, and economic census. Used in data engineering for population analysis pipelines, market research, geospatial enrichment, and building socioeconomic dashboards in Python.
The National Library of Medicine hosts PubMed, MedlinePlus, GenBank, and other biomedical databases covering clinical literature, genetic sequences, drug information, and medical terminology. Used in healthcare data engineering pipelines, clinical NLP workflows, and biomedical research ingestion in Python.