Scikit-learn Tutorial

This document explains the Scikit-learn example provided in skickit_learn_example.py.

Overview

The example demonstrates how to:

Load and preprocess customer churn data
Train a Random Forest model to predict customer churn
Evaluate the model's performance
Perform cross-validation
Analyze feature importance

Code Explanation

Imports and Data Loading

import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer

output_dir = os.path.dirname(os.path.abspath(__file__))
data_file = os.path.join(output_dir, 'telecom_customer_churn.csv')
data = pd.read_csv(data_file)

This section imports necessary modules and loads the customer churn data from a CSV file. The file path is constructed using the script's directory to ensure it works regardless of the current working directory.

Test data can be generated with the hep of additioanl script generate_customer_test_data.py:

python generate_customer_test_data.py

Data Preprocessing

X = data.drop(['CustomerID', 'Churn'], axis=1)
y = data['Churn']

X = pd.get_dummies(X, drop_first=True)

imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

Here, we preprocess the data by:

Separating features (X) and target variable (y)
Converting categorical variables to dummy variables
Imputing missing values with mean strategy

Data Splitting and Scaling

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

This code splits the data into training and testing sets, then scales the features using StandardScaler to normalize the data.

Model Training and Prediction

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

y_pred = rf_model.predict(X_test_scaled)

Here, we train a Random Forest model with 100 trees and use it to make predictions on the test set.

Model Evaluation

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

cv_scores = cross_val_score(rf_model, X_train_scaled, y_train, cv=5)
print("\nCross-validation scores:", cv_scores)
print("Mean CV score:", cv_scores.mean())

This section evaluates the model's performance using accuracy score, classification report, and 5-fold cross-validation.

Feature Importance Analysis

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 important features:")
print(feature_importance.head(10))

Finally, we extract and display the top 10 most important features from the Random Forest model.

Key Features of This Skickit-learn Project

Data Preprocessing:

We use pandas for data manipulation, which is crucial in data engineering.
The pd.get_dummies() function handles categorical variables, a common task in real-world datasets.
SimpleImputer deals with missing values, another frequent issue in raw data.
Significance: Proper data preprocessing is essential for model accuracy and robustness.

Train-Test Split:

train_test_split() separates our data into training and testing sets.
Significance: This prevents overfitting and gives us an unbiased evaluation of our model.

Feature Scaling:

StandardScaler() normalizes our features.
Significance: Many machine learning algorithms perform better with scaled features, especially when features are on different scales.

Model Training:

We use RandomForestClassifier, a powerful ensemble method.
Significance: Random Forests are versatile, handle non-linear relationships well, and are less prone to overfitting.

Model Evaluation:

We use accuracy_score() and classification_report() for a comprehensive evaluation.
Significance: These metrics give us a clear picture of model performance, including precision, recall, and F1-score for each class.

Cross-Validation:

cross_val_score() performs k-fold cross-validation.
Significance: This gives a more robust estimate of model performance and helps detect overfitting.

Feature Importance:

We extract and display feature importances from the Random Forest model.
Significance: This helps in feature selection and provides insights into which factors most influence customer churn.

This example demonstrates how Scikit-learn integrates seamlessly into a data engineering workflow, from data preprocessing to model evaluation and interpretation. It showcases the library's consistency in API design, making it easy to swap out different models or preprocessing steps.

Running the Example

To run this example:

Ensure you have the required libraries installed:
```
pip install pandas numpy scikit-learn
```
Generate the sample data by running:
```
python generate_customer_test_data.py
```
Run the Scikit-learn example:
```
python skickit_learn_example.py
```

The Model Prediction Results

Accuracy: 0.66

Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.90      0.79       141
           1       0.26      0.08      0.13        59

    accuracy                           0.66       200
   macro avg       0.48      0.49      0.46       200
weighted avg       0.57      0.66      0.59       200


Cross-validation scores: [0.66875 0.7     0.69375 0.6625  0.65625]
Mean CV score: 0.67625

Top 10 important features:
                       feature  importance
2               MonthlyCharges    0.270630
3                 TotalCharges    0.263459
0                          Age    0.234077
1                       Tenure    0.132874
6            Contract_One year    0.026365
7            Contract_Two year    0.026303
5           InternetService_No    0.023642
4  InternetService_Fiber optic    0.022650

Scikit-learn Tutorial

This document explains the Scikit-learn example provided in skickit_learn_example.py.

Overview

The example demonstrates how to:

Load and preprocess customer churn data
Train a Random Forest model to predict customer churn
Evaluate the model's performance
Perform cross-validation
Analyze feature importance

Code Explanation

Imports and Data Loading

import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer

output_dir = os.path.dirname(os.path.abspath(__file__))
data_file = os.path.join(output_dir, 'telecom_customer_churn.csv')
data = pd.read_csv(data_file)

Test data can be generated with the hep of additioanl script generate_customer_test_data.py:

python generate_customer_test_data.py

Data Preprocessing

X = data.drop(['CustomerID', 'Churn'], axis=1)
y = data['Churn']

X = pd.get_dummies(X, drop_first=True)

imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

Here, we preprocess the data by:

Separating features (X) and target variable (y)
Converting categorical variables to dummy variables
Imputing missing values with mean strategy

Data Splitting and Scaling

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

This code splits the data into training and testing sets, then scales the features using StandardScaler to normalize the data.

Model Training and Prediction

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

y_pred = rf_model.predict(X_test_scaled)

Here, we train a Random Forest model with 100 trees and use it to make predictions on the test set.

Model Evaluation

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

cv_scores = cross_val_score(rf_model, X_train_scaled, y_train, cv=5)
print("\nCross-validation scores:", cv_scores)
print("Mean CV score:", cv_scores.mean())

This section evaluates the model's performance using accuracy score, classification report, and 5-fold cross-validation.

Feature Importance Analysis

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 important features:")
print(feature_importance.head(10))

Finally, we extract and display the top 10 most important features from the Random Forest model.

Key Features of This Skickit-learn Project

Data Preprocessing:

We use pandas for data manipulation, which is crucial in data engineering.
The pd.get_dummies() function handles categorical variables, a common task in real-world datasets.
SimpleImputer deals with missing values, another frequent issue in raw data.
Significance: Proper data preprocessing is essential for model accuracy and robustness.

Train-Test Split:

train_test_split() separates our data into training and testing sets.
Significance: This prevents overfitting and gives us an unbiased evaluation of our model.

Feature Scaling:

StandardScaler() normalizes our features.
Significance: Many machine learning algorithms perform better with scaled features, especially when features are on different scales.

Model Training:

We use RandomForestClassifier, a powerful ensemble method.
Significance: Random Forests are versatile, handle non-linear relationships well, and are less prone to overfitting.

Model Evaluation:

We use accuracy_score() and classification_report() for a comprehensive evaluation.
Significance: These metrics give us a clear picture of model performance, including precision, recall, and F1-score for each class.

Cross-Validation:

cross_val_score() performs k-fold cross-validation.
Significance: This gives a more robust estimate of model performance and helps detect overfitting.

Feature Importance:

We extract and display feature importances from the Random Forest model.
Significance: This helps in feature selection and provides insights into which factors most influence customer churn.

Running the Example

To run this example:

Ensure you have the required libraries installed:
```
pip install pandas numpy scikit-learn
```
Generate the sample data by running:
```
python generate_customer_test_data.py
```
Run the Scikit-learn example:
```
python skickit_learn_example.py
```

The Model Prediction Results

Accuracy: 0.66

Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.90      0.79       141
           1       0.26      0.08      0.13        59

    accuracy                           0.66       200
   macro avg       0.48      0.49      0.46       200
weighted avg       0.57      0.66      0.59       200


Cross-validation scores: [0.66875 0.7     0.69375 0.6625  0.65625]
Mean CV score: 0.67625

Top 10 important features:
                       feature  importance
2               MonthlyCharges    0.270630
3                 TotalCharges    0.263459
0                          Age    0.234077
1                       Tenure    0.132874
6            Contract_One year    0.026365
7            Contract_Two year    0.026303
5           InternetService_No    0.023642
4  InternetService_Fiber optic    0.022650

Customer Churn Prediction with Scikit-learn

Prerequisites

What You'll Learn

Scikit-learn Tutorial

Overview

Code Explanation

Imports and Data Loading

Data Preprocessing

Data Splitting and Scaling

Model Training and Prediction

Model Evaluation

Feature Importance Analysis

Key Features of This Skickit-learn Project

Data Preprocessing:

Train-Test Split:

Feature Scaling:

Model Training:

Model Evaluation:

Cross-Validation:

Feature Importance:

Running the Example

The Model Prediction Results

Category

Tools Used

Customer Churn Prediction with Scikit-learn

Prerequisites

What You'll Learn

Scikit-learn Tutorial

Overview

Code Explanation

Imports and Data Loading

Data Preprocessing

Data Splitting and Scaling

Model Training and Prediction

Model Evaluation

Feature Importance Analysis

Key Features of This Skickit-learn Project

Data Preprocessing:

Train-Test Split:

Feature Scaling:

Model Training:

Model Evaluation:

Cross-Validation:

Feature Importance:

Running the Example

The Model Prediction Results

Category

Tools Used