The Amazon Customer Reviews dataset on AWS Open Data contains 130+ million product reviews across 40+ categories with star ratings, review text, and helpfulness votes. Used in NLP data engineering for sentiment analysis, recommendation system training, and large-scale text processing pipelines in Python.
Amazon Customer Reviews are available as gzipped TSV files by product category from the Amazon Registry of Open Data on AWS (S3). Engineers use `pandas.read_csv()` with chunked reading for large categories (Electronics alone has 7M+ reviews), storing in Parquet for ML pipelines.
Amazon Customer Reviews is one of the most widely used datasets for fine-tuning sentiment analysis and recommendation AI models. The rich review text with star ratings provides labeled training data for opinion mining, and the helpfulness votes train models to rank review quality for RAG knowledge bases.
# pip install pandas pyarrow
import pandas as pd
# Access via AWS Open Data (no download needed with S3 + boto3)
# Or download a category TSV directly:
url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_02.tsv.gz"
df = pd.read_csv(url, sep="\t", compression="gzip",
usecols=["star_rating", "review_headline", "review_body"],
nrows=10_000)
print(df["star_rating"].value_counts())Official dataset source
More datasets used by Python data engineers.
The United Nations Development Programme publishes datasets on the Human Development Index, poverty rates, gender equality, and Sustainable Development Goal progress across 190+ countries. Used in data engineering for global development analytics, SDG monitoring pipelines, and country comparison dashboards in Python.
Natural Earth provides public domain map datasets at various scales, covering physical and cultural features such as coastlines, rivers, cities and political boundaries.
Data.gov hosts 300,000+ datasets from US federal agencies covering health, education, environment, agriculture, finance, and transportation. Used in data engineering for government analytics pipelines, public health research, geospatial analysis, and building civic data applications with Python.