Build a machine learning model to predict customer churn using Scikit-learn's Random Forest classifier. Learn data preprocessing, model training, evaluation metrics, cross-validation, and feature importance analysis - foundational ML skills every data engineer should master.
This document explains the Scikit-learn example provided in skickit_learn_example.py.
The example demonstrates how to:
import os import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report from sklearn.impute import SimpleImputer output_dir = os.path.dirname(os.path.abspath(__file__)) data_file = os.path.join(output_dir, 'telecom_customer_churn.csv') data = pd.read_csv(data_file)
This section imports necessary modules and loads the customer churn data from a CSV file. The file path is constructed using the script's directory to ensure it works regardless of the current working directory.
Test data can be generated with the hep of additioanl script generate_customer_test_data.py:
python generate_customer_test_data.py
X = data.drop(['CustomerID', 'Churn'], axis=1) y = data['Churn'] X = pd.get_dummies(X, drop_first=True) imputer = SimpleImputer(strategy='mean') X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)
Here, we preprocess the data by:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)
This code splits the data into training and testing sets, then scales the features using StandardScaler to normalize the data.
rf_model = RandomForestClassifier(n_estimators=100, random_state=42) rf_model.fit(X_train_scaled, y_train) y_pred = rf_model.predict(X_test_scaled)
Here, we train a Random Forest model with 100 trees and use it to make predictions on the test set.
print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:") print(classification_report(y_test, y_pred)) cv_scores = cross_val_score(rf_model, X_train_scaled, y_train, cv=5) print("\nCross-validation scores:", cv_scores) print("Mean CV score:", cv_scores.mean())
This section evaluates the model's performance using accuracy score, classification report, and 5-fold cross-validation.
feature_importance = pd.DataFrame({ 'feature': X.columns, 'importance': rf_model.feature_importances_ }).sort_values('importance', ascending=False) print("\nTop 10 important features:") print(feature_importance.head(10))
Finally, we extract and display the top 10 most important features from the Random Forest model.
We use pandas for data manipulation, which is crucial in data engineering.
The pd.get_dummies() function handles categorical variables, a common task in real-world datasets.
SimpleImputer deals with missing values, another frequent issue in raw data.
Significance: Proper data preprocessing is essential for model accuracy and robustness.
train_test_split() separates our data into training and testing sets.
Significance: This prevents overfitting and gives us an unbiased evaluation of our model.
StandardScaler() normalizes our features.
Significance: Many machine learning algorithms perform better with scaled features, especially when features are on different scales.
We use RandomForestClassifier, a powerful ensemble method.
Significance: Random Forests are versatile, handle non-linear relationships well, and are less prone to overfitting.
We use accuracy_score() and classification_report() for a comprehensive evaluation.
Significance: These metrics give us a clear picture of model performance, including precision, recall, and F1-score for each class.
cross_val_score() performs k-fold cross-validation.
Significance: This gives a more robust estimate of model performance and helps detect overfitting.
We extract and display feature importances from the Random Forest model.
Significance: This helps in feature selection and provides insights into which factors most influence customer churn.
This example demonstrates how Scikit-learn integrates seamlessly into a data engineering workflow, from data preprocessing to model evaluation and interpretation. It showcases the library's consistency in API design, making it easy to swap out different models or preprocessing steps.
To run this example:
Ensure you have the required libraries installed:
pip install pandas numpy scikit-learn
Generate the sample data by running:
python generate_customer_test_data.py
Run the Scikit-learn example:
python skickit_learn_example.py
Accuracy: 0.66
Classification Report:
precision recall f1-score support
0 0.70 0.90 0.79 141
1 0.26 0.08 0.13 59
accuracy 0.66 200
macro avg 0.48 0.49 0.46 200
weighted avg 0.57 0.66 0.59 200
Cross-validation scores: [0.66875 0.7 0.69375 0.6625 0.65625]
Mean CV score: 0.67625
Top 10 important features:
feature importance
2 MonthlyCharges 0.270630
3 TotalCharges 0.263459
0 Age 0.234077
1 Tenure 0.132874
6 Contract_One year 0.026365
7 Contract_Two year 0.026303
5 InternetService_No 0.023642
4 InternetService_Fiber optic 0.022650