New York City's open data portal provides 3,000+ datasets covering taxi trips, 311 complaints, crime statistics, building permits, health inspections, and transit data. Used in urban data engineering pipelines for city analytics, transportation modelling, and building geospatial dashboards in Python.
NYC Open Data uses the Socrata Open Data API (SODA). Engineers use the `sodapy` Python client with a NYC app token to query datasets, filter by column, and paginate through millions of records. The taxi trip datasets (billions of rows) require chunked ingestion into Spark or BigQuery.
NYC Open Data is a classic AI training resource — the taxi trip dataset alone has trained countless ML models. Build a location-aware AI system indexed on 311 complaint data to answer 'What are the most common noise complaints in Brooklyn?' or train predictive models for crime patterns and service demand.
# pip install sodapy pandas
from sodapy import Socrata
import pandas as pd
client = Socrata("data.cityofnewyork.us", "YOUR_APP_TOKEN")
# 311 service requests
results = client.get("erm2-nwe9", where="created_date > '2024-01-01'",
limit=1000, select="complaint_type,borough,created_date")
df = pd.DataFrame.from_records(results)
print(df["complaint_type"].value_counts().head(10))Official dataset source
More datasets used by Python data engineers.
The World Bank World Development Indicators provides 1,600+ time-series indicators covering poverty, health, education, infrastructure, and environment for 217 countries from 1960 onwards. Used in data engineering for global development dashboards, longitudinal analysis pipelines, and economic research systems in Python.
Access datasets on child well-being, education enrolment, nutrition, immunisation, child mortality, and child protection indicators worldwide from UNICEF. Used in data engineering for humanitarian analytics pipelines, SDG progress tracking, and building global child health indicator dashboards in Python.
Gapminder provides clean, long-run historical datasets on 500+ global development indicators including income per capita, life expectancy, fertility rates, and CO2 emissions for 195 countries. Used in data engineering for development analytics, animated visualisation pipelines, and building SDG tracking systems in Python.