PyTorch Tutorial

This document explains the PyTorch example provided in pytorch_example.py.

Overview

The example demonstrates how to:

Load and preprocess network traffic data
Build an autoencoder model using PyTorch
Train the model for anomaly detection
Detect anomalies in network traffic
Save, load, and visualize the results

Code Explanation

Imports and Data Loading

import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

output_dir = os.path.dirname(os.path.abspath(__file__))

def load_data(filename):
    data_path = os.path.join(output_dir, filename)
    df = pd.read_csv(data_path)
    return df

df = load_data('network_traffic.csv')

This section imports necessary modules and loads the network traffic data from a CSV file. The file path is constructed using the script's directory to ensure it works regardless of the current working directory.

Generate test network data by using the script generate_network_traffic_data.py:

python generate_network_traffic_data.py

Data Preprocessing

def encode_ip(ip):
    return [int(x) for x in ip.split('.')]

features = df.drop(['timestamp', 'is_anomaly'], axis=1)
labels = df['is_anomaly']

features['source_ip_encoded'] = features['source_ip'].apply(encode_ip)
features['dest_ip_encoded'] = features['dest_ip'].apply(encode_ip)

numeric_features = ['packet_count', 'byte_count']
ip_features = ['source_ip_encoded', 'dest_ip_encoded']
categorical_features = ['protocol']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(sparse_output=False), categorical_features)
    ])

features_processed = preprocessor.fit_transform(features.drop(ip_features + ['source_ip', 'dest_ip'], axis=1))
ip_data = np.hstack([np.vstack(features['source_ip_encoded']), np.vstack(features['dest_ip_encoded'])])
features_processed = np.hstack([features_processed, ip_data])
features_processed = features_processed.astype(np.float32)

This section preprocesses the data by:

Encoding IP addresses
Scaling numeric features
One-hot encoding categorical features
Combining all processed features

Model Definition

class Autoencoder(nn.Module):
    def __init__(self, input_dim):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8)
        )
        self.decoder = nn.Sequential(
            nn.Linear(8, 16),
            nn.ReLU(),
            nn.Linear(16, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

This defines the Autoencoder model using PyTorch's nn.Module. The model consists of an encoder that compresses the input data and a decoder that reconstructs it.

Dataset and DataLoader

class NetworkTrafficDataset(Dataset):
    def __init__(self, features, labels):
        self.features = torch.FloatTensor(features)
        self.labels = torch.FloatTensor(labels.values)

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

dataset = NetworkTrafficDataset(features_processed, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

This creates a custom Dataset class and a DataLoader for efficient batching and shuffling of the data.

Model Training

input_dim = features_processed.shape[1]
model = Autoencoder(input_dim)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters())

num_epochs = 50
for epoch in range(num_epochs):
    for data in dataloader:
        inputs, _ = data
        outputs = model(inputs)
        loss = criterion(outputs, inputs)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

This section trains the Autoencoder model using the Adam optimizer and Mean Squared Error loss.

Anomaly Detection

model.eval()
with torch.no_grad():
    features_tensor = torch.FloatTensor(features_processed)
    reconstructions = model(features_tensor)
    mse_loss = nn.MSELoss(reduction='none')
    mse = mse_loss(reconstructions, features_tensor)
    mse = mse.mean(axis=1).numpy()

    threshold = np.percentile(mse, 95)  # 95th percentile as threshold
    anomalies = mse > threshold

This part uses the trained model to detect anomalies by comparing the reconstruction error to a threshold.

Model Saving and Loading

model_path = os.path.join(output_dir, 'anomaly_detection_model.pth')
torch.save(model.state_dict(), model_path)

loaded_model = Autoencoder(input_dim)
loaded_model.load_state_dict(torch.load(model_path))

This demonstrates how to save and load the trained model.

Visualization

plt.figure(figsize=(12, 6))
plt.plot(df['timestamp'], mse, label='Reconstruction Error')
plt.axhline(y=threshold, color='r', linestyle='--', label='Anomaly Threshold')
plt.title('Network Traffic Anomaly Detection')
plt.xlabel('Timestamp')
plt.ylabel('Reconstruction Error')
plt.legend()
plt.savefig(os.path.join(output_dir, 'anomaly_detection.png'))
plt.close()

Finally, this section visualizes the reconstruction error and the anomaly threshold.

Network Traffic Anomaly Decection Visualization

Running the Example

To run this example:

Ensure you have the required libraries installed:

pip install torch numpy pandas matplotlib scikit-learn

Generate the sample data by running the data generation script (if provided).
Run the PyTorch example:
```
python pytorch_example.py
```

The script will load the data, train the model, detect anomalies, and save a visualization of the results in the output directory.

Key Features of This Pytorch Project

Custom Dataset and DataLoader:

We create a custom NetworkTrafficDataset class and use PyTorch's DataLoader.
Significance: This allows efficient handling of large datasets, enabling batch processing and easy shuffling of data.

Autoencoder Architecture:

We implement an autoencoder using PyTorch's nn.Module.
Significance: Autoencoders are powerful for unsupervised learning tasks like anomaly detection. They learn to compress and reconstruct data, making them sensitive to anomalies.

Model Training:

We use PyTorch's optimizers and loss functions to train the model.
Significance: PyTorch's dynamic computation graph allows for flexible and efficient training of deep learning models.

Anomaly Detection:

We use the trained model to reconstruct input data and calculate reconstruction error.
Significance: High reconstruction error indicates potential anomalies, providing a data-driven approach to detect unusual network traffic patterns.

Model Saving and Loading:

We demonstrate how to save and load PyTorch models.
Significance: This feature is crucial for deploying models in production environments, allowing for model persistence and easy transfer between systems.

Visualization:

We use matplotlib to visualize the reconstruction error and anomaly threshold.
Significance: Visualization helps in understanding the model's performance and interpreting the results, which is crucial for data engineers and analysts.

PyTorch Tutorial

This document explains the PyTorch example provided in pytorch_example.py.

Overview

The example demonstrates how to:

Load and preprocess network traffic data
Build an autoencoder model using PyTorch
Train the model for anomaly detection
Detect anomalies in network traffic
Save, load, and visualize the results

Code Explanation

Imports and Data Loading

import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

output_dir = os.path.dirname(os.path.abspath(__file__))

def load_data(filename):
    data_path = os.path.join(output_dir, filename)
    df = pd.read_csv(data_path)
    return df

df = load_data('network_traffic.csv')

Generate test network data by using the script generate_network_traffic_data.py:

python generate_network_traffic_data.py

Data Preprocessing

def encode_ip(ip):
    return [int(x) for x in ip.split('.')]

features = df.drop(['timestamp', 'is_anomaly'], axis=1)
labels = df['is_anomaly']

features['source_ip_encoded'] = features['source_ip'].apply(encode_ip)
features['dest_ip_encoded'] = features['dest_ip'].apply(encode_ip)

numeric_features = ['packet_count', 'byte_count']
ip_features = ['source_ip_encoded', 'dest_ip_encoded']
categorical_features = ['protocol']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(sparse_output=False), categorical_features)
    ])

features_processed = preprocessor.fit_transform(features.drop(ip_features + ['source_ip', 'dest_ip'], axis=1))
ip_data = np.hstack([np.vstack(features['source_ip_encoded']), np.vstack(features['dest_ip_encoded'])])
features_processed = np.hstack([features_processed, ip_data])
features_processed = features_processed.astype(np.float32)

This section preprocesses the data by:

Encoding IP addresses
Scaling numeric features
One-hot encoding categorical features
Combining all processed features

Model Definition

class Autoencoder(nn.Module):
    def __init__(self, input_dim):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8)
        )
        self.decoder = nn.Sequential(
            nn.Linear(8, 16),
            nn.ReLU(),
            nn.Linear(16, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

This defines the Autoencoder model using PyTorch's nn.Module. The model consists of an encoder that compresses the input data and a decoder that reconstructs it.

Dataset and DataLoader

class NetworkTrafficDataset(Dataset):
    def __init__(self, features, labels):
        self.features = torch.FloatTensor(features)
        self.labels = torch.FloatTensor(labels.values)

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

dataset = NetworkTrafficDataset(features_processed, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

This creates a custom Dataset class and a DataLoader for efficient batching and shuffling of the data.

Model Training

input_dim = features_processed.shape[1]
model = Autoencoder(input_dim)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters())

num_epochs = 50
for epoch in range(num_epochs):
    for data in dataloader:
        inputs, _ = data
        outputs = model(inputs)
        loss = criterion(outputs, inputs)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

This section trains the Autoencoder model using the Adam optimizer and Mean Squared Error loss.

Anomaly Detection

model.eval()
with torch.no_grad():
    features_tensor = torch.FloatTensor(features_processed)
    reconstructions = model(features_tensor)
    mse_loss = nn.MSELoss(reduction='none')
    mse = mse_loss(reconstructions, features_tensor)
    mse = mse.mean(axis=1).numpy()

    threshold = np.percentile(mse, 95)  # 95th percentile as threshold
    anomalies = mse > threshold

This part uses the trained model to detect anomalies by comparing the reconstruction error to a threshold.

Model Saving and Loading

model_path = os.path.join(output_dir, 'anomaly_detection_model.pth')
torch.save(model.state_dict(), model_path)

loaded_model = Autoencoder(input_dim)
loaded_model.load_state_dict(torch.load(model_path))

This demonstrates how to save and load the trained model.

Visualization

plt.figure(figsize=(12, 6))
plt.plot(df['timestamp'], mse, label='Reconstruction Error')
plt.axhline(y=threshold, color='r', linestyle='--', label='Anomaly Threshold')
plt.title('Network Traffic Anomaly Detection')
plt.xlabel('Timestamp')
plt.ylabel('Reconstruction Error')
plt.legend()
plt.savefig(os.path.join(output_dir, 'anomaly_detection.png'))
plt.close()

Finally, this section visualizes the reconstruction error and the anomaly threshold.

Network Traffic Anomaly Decection Visualization

Running the Example

To run this example:

Ensure you have the required libraries installed:

pip install torch numpy pandas matplotlib scikit-learn

Generate the sample data by running the data generation script (if provided).
Run the PyTorch example:
```
python pytorch_example.py
```

The script will load the data, train the model, detect anomalies, and save a visualization of the results in the output directory.

Key Features of This Pytorch Project

Custom Dataset and DataLoader:

We create a custom NetworkTrafficDataset class and use PyTorch's DataLoader.
Significance: This allows efficient handling of large datasets, enabling batch processing and easy shuffling of data.

Autoencoder Architecture:

We implement an autoencoder using PyTorch's nn.Module.
Significance: Autoencoders are powerful for unsupervised learning tasks like anomaly detection. They learn to compress and reconstruct data, making them sensitive to anomalies.

Model Training:

We use PyTorch's optimizers and loss functions to train the model.
Significance: PyTorch's dynamic computation graph allows for flexible and efficient training of deep learning models.

Anomaly Detection:

We use the trained model to reconstruct input data and calculate reconstruction error.
Significance: High reconstruction error indicates potential anomalies, providing a data-driven approach to detect unusual network traffic patterns.

Model Saving and Loading:

We demonstrate how to save and load PyTorch models.
Significance: This feature is crucial for deploying models in production environments, allowing for model persistence and easy transfer between systems.

Visualization:

We use matplotlib to visualize the reconstruction error and anomaly threshold.
Significance: Visualization helps in understanding the model's performance and interpreting the results, which is crucial for data engineers and analysts.

Network Anomaly Detection with PyTorch

Prerequisites

What You'll Learn

PyTorch Tutorial

Overview

Code Explanation

Imports and Data Loading

Data Preprocessing

Model Definition

Dataset and DataLoader

Model Training

Anomaly Detection

Model Saving and Loading

Visualization

Running the Example

Key Features of This Pytorch Project

Custom Dataset and DataLoader:

Autoencoder Architecture:

Model Training:

Anomaly Detection:

Model Saving and Loading:

Visualization:

Category

Tools Used

Network Anomaly Detection with PyTorch

Prerequisites

What You'll Learn

PyTorch Tutorial

Overview

Code Explanation

Imports and Data Loading

Data Preprocessing

Model Definition

Dataset and DataLoader

Model Training

Anomaly Detection

Model Saving and Loading

Visualization

Running the Example

Key Features of This Pytorch Project

Custom Dataset and DataLoader:

Autoencoder Architecture:

Model Training:

Anomaly Detection:

Model Saving and Loading:

Visualization:

Category

Tools Used