Master essential data wrangling tasks with Pandas through a practical sales data analysis project. Learn to load CSV files, clean messy data, handle missing values, engineer new features, and perform powerful grouping and aggregation operations that form the foundation of any data pipeline.
This example demonstrates how to use Pandas for common data wrangling tasks in a data engineering context. It covers:
import pandas as pd import numpy as np # This will ensure we're referencing the correct files regardless of the current working directory: import os # Get the directory of the current script script_dir = os.path.dirname(os.path.abspath(__file__)) # Construct full file paths sales_file_path = os.path.join(script_dir, 'sales_data.csv') customer_file_path = os.path.join(script_dir, 'customer_data.csv') # Load sales and customer data from CSV sales_df = pd.read_csv(sales_file_path) customer_df = pd.read_csv(customer_file_path)
This section imports the necessary modules and loads the sales and customer data from a CSV files.
sales_df['date'] = pd.to_datetime(sales_df['date']) sales_df['product'] = sales_df['product'].str.lower()
Here, we convert the 'date' column to datetime format and standardize the 'product' column by converting it to lowercase.
sales_df['quantity'] = sales_df['quantity'].fillna(sales_df['quantity'].mean())
This code fills missing values in the 'quantity' column with the mean value.
sales_df['total_revenue'] = sales_df['quantity'] * sales_df['price']
We create a new 'total_revenue' column by multiplying quantity and price.
print(sales_df['total_revenue'].describe())
This prints summary statistics for the 'total_revenue' column.
monthly_sales = sales_df.groupby(sales_df['date'].dt.to_period('M'))['total_revenue'].sum() print(monthly_sales)
Here, we group the data by month and calculate the total revenue for each month.
merged_df = pd.merge(sales_df, customer_df, on='customer_id', how='left')
This code loads customer data and merges it with the sales data based on the customer ID.
analysis_file_path = os.path.join(script_dir, 'sales_analysis.csv') merged_df.to_csv(analysis_file_path, index=False)
Finally, we export the merged and processed data to a new CSV file.
To run this example:
Ensure you have Pandas installed:
pip install pandas
Prepare your data files:
Save the Python code in a file, e.g., 'pandas_example.py'
Run the script:
python pandas_example.py
The script will process the sales data, merge it with customer data, perform analysis, and create a 'sales_analysis.csv' file with the results.