Google Cloud hosts petabyte-scale public datasets including genomics, satellite imagery, financial markets, Wikipedia, and GitHub data in BigQuery. Used in data engineering for large-scale analytics, cross-dataset joins in SQL, and building cloud-native data pipelines using BigQuery and Python.
Engineers use the `google-cloud-bigquery` Python library with a Google Cloud project to query public datasets. `pd.read_gbq()` or `client.query().to_dataframe()` return results as pandas DataFrames. Many datasets include blockchain, genomics, and weather data too large to download locally.
Google Cloud Public Datasets provide a BigQuery-native environment for training AI models on multi-terabyte datasets. Use BigQuery ML to train models directly on the data without ETL, or export subsets to Vertex AI for fine-tuning. The GitHub and Stack Overflow datasets are particularly useful for code AI training.
# pip install google-cloud-bigquery pandas-gbq
from google.cloud import bigquery
client = bigquery.Client(project="YOUR_PROJECT_ID")
query = (
"SELECT title, SUM(views) as total_views "
"FROM `bigquery-samples.wikipedia_benchmark.Wiki10B` "
"GROUP BY title ORDER BY total_views DESC LIMIT 10"
)
df = client.query(query).to_dataframe()
print(df)Official dataset source
More datasets used by Python data engineers.
The FEC provides access to campaign finance data, including information on political contributions, campaign expenditures, fundraising activities and financial disclosures filed by political candidates, parties and committees in the United States.
Data.gov hosts 300,000+ datasets from US federal agencies covering health, education, environment, agriculture, finance, and transportation. Used in data engineering for government analytics pipelines, public health research, geospatial analysis, and building civic data applications with Python.
Data.gov.uk provides datasets from UK central and local government covering crime, transport, planning, health, and environment. Used in data engineering for public sector analytics, policy research pipelines, geospatial visualisation, and building civic technology applications in Python.