This a collaborative database of food products from around the world, containing information on ingredients, nutritional values, labels and food additives.
Open Food Facts provides a complete database dump as CSV (downloadable from the website) and a REST API. Engineers load the CSV with `pandas.read_csv()` and handle the 180+ column schema with column selection. The `openfoodfacts` Python library provides API access for individual product lookups.
Open Food Facts data trains AI nutrition analysis models that scan product barcodes and explain ingredient risks. RAG systems built on this dataset answer 'Does this product contain palm oil or high-fructose corn syrup?' A fine-tuned model on Nutri-Score ratings classifies food healthiness.
# pip install openfoodfacts pandas
import openfoodfacts, pandas as pd
# Search products by category
results = openfoodfacts.products.get_by_category("cereals")
products = results["products"]
df = pd.DataFrame(products)[["product_name", "nutriscore_grade", "energy_100g"]]
print(df.dropna(subset=["nutriscore_grade"]).head(10))Official dataset source
More datasets used by Python data engineers.
A curated repository of 600+ datasets covering classification, regression, clustering, and time-series tasks, widely used as machine learning benchmarks. Used in data engineering for building ML training pipelines, practising data preprocessing workflows, and loading tabular datasets into model training systems in Python.
Thousands of publicly available datasets hosted on GitHub repositories covering social media, finance, healthcare, sports, and scientific domains. Accessible directly via the GitHub API or raw download URLs, making them ideal for practising version-controlled data ingestion and automated dataset pipelines in Python.
NOAA platform provides access to a vast collection of climate-related datasets, including historical weather data, climate observations, satellite imagery and climate model outputs.