Implement an autoencoder neural network in PyTorch for unsupervised anomaly detection in network traffic. Learn PyTorch's nn.Module, custom datasets, DataLoaders, and how to identify outliers - critical for security and monitoring applications.
This document explains the PyTorch example provided in pytorch_example.py.
The example demonstrates how to:
import os import numpy as np import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer output_dir = os.path.dirname(os.path.abspath(__file__)) def load_data(filename): data_path = os.path.join(output_dir, filename) df = pd.read_csv(data_path) return df df = load_data('network_traffic.csv')
This section imports necessary modules and loads the network traffic data from a CSV file. The file path is constructed using the script's directory to ensure it works regardless of the current working directory.
Generate test network data by using the script generate_network_traffic_data.py:
python generate_network_traffic_data.py
def encode_ip(ip): return [int(x) for x in ip.split('.')] features = df.drop(['timestamp', 'is_anomaly'], axis=1) labels = df['is_anomaly'] features['source_ip_encoded'] = features['source_ip'].apply(encode_ip) features['dest_ip_encoded'] = features['dest_ip'].apply(encode_ip) numeric_features = ['packet_count', 'byte_count'] ip_features = ['source_ip_encoded', 'dest_ip_encoded'] categorical_features = ['protocol'] preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(sparse_output=False), categorical_features) ]) features_processed = preprocessor.fit_transform(features.drop(ip_features + ['source_ip', 'dest_ip'], axis=1)) ip_data = np.hstack([np.vstack(features['source_ip_encoded']), np.vstack(features['dest_ip_encoded'])]) features_processed = np.hstack([features_processed, ip_data]) features_processed = features_processed.astype(np.float32)
This section preprocesses the data by:
class Autoencoder(nn.Module): def __init__(self, input_dim): super(Autoencoder, self).__init__() self.encoder = nn.Sequential( nn.Linear(input_dim, 64), nn.ReLU(), nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, 16), nn.ReLU(), nn.Linear(16, 8) ) self.decoder = nn.Sequential( nn.Linear(8, 16), nn.ReLU(), nn.Linear(16, 32), nn.ReLU(), nn.Linear(32, 64), nn.ReLU(), nn.Linear(64, input_dim) ) def forward(self, x): encoded = self.encoder(x) decoded = self.decoder(encoded) return decoded
This defines the Autoencoder model using PyTorch's nn.Module. The model consists of an encoder that compresses the input data and a decoder that reconstructs it.
class NetworkTrafficDataset(Dataset): def __init__(self, features, labels): self.features = torch.FloatTensor(features) self.labels = torch.FloatTensor(labels.values) def __len__(self): return len(self.features) def __getitem__(self, idx): return self.features[idx], self.labels[idx] dataset = NetworkTrafficDataset(features_processed, labels) dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
This creates a custom Dataset class and a DataLoader for efficient batching and shuffling of the data.
input_dim = features_processed.shape[1] model = Autoencoder(input_dim) criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters()) num_epochs = 50 for epoch in range(num_epochs): for data in dataloader: inputs, _ = data outputs = model(inputs) loss = criterion(outputs, inputs) optimizer.zero_grad() loss.backward() optimizer.step() if (epoch + 1) % 10 == 0: print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
This section trains the Autoencoder model using the Adam optimizer and Mean Squared Error loss.
model.eval() with torch.no_grad(): features_tensor = torch.FloatTensor(features_processed) reconstructions = model(features_tensor) mse_loss = nn.MSELoss(reduction='none') mse = mse_loss(reconstructions, features_tensor) mse = mse.mean(axis=1).numpy() threshold = np.percentile(mse, 95) # 95th percentile as threshold anomalies = mse > threshold
This part uses the trained model to detect anomalies by comparing the reconstruction error to a threshold.
model_path = os.path.join(output_dir, 'anomaly_detection_model.pth') torch.save(model.state_dict(), model_path) loaded_model = Autoencoder(input_dim) loaded_model.load_state_dict(torch.load(model_path))
This demonstrates how to save and load the trained model.
plt.figure(figsize=(12, 6)) plt.plot(df['timestamp'], mse, label='Reconstruction Error') plt.axhline(y=threshold, color='r', linestyle='--', label='Anomaly Threshold') plt.title('Network Traffic Anomaly Detection') plt.xlabel('Timestamp') plt.ylabel('Reconstruction Error') plt.legend() plt.savefig(os.path.join(output_dir, 'anomaly_detection.png')) plt.close()
Finally, this section visualizes the reconstruction error and the anomaly threshold.

To run this example:
Ensure you have the required libraries installed:
pip install torch numpy pandas matplotlib scikit-learn
Generate the sample data by running the data generation script (if provided).
Run the PyTorch example:
python pytorch_example.py
The script will load the data, train the model, detect anomalies, and save a visualization of the results in the output directory.
We create a custom NetworkTrafficDataset class and use PyTorch's DataLoader.
Significance: This allows efficient handling of large datasets, enabling batch processing and easy shuffling of data.
We implement an autoencoder using PyTorch's nn.Module.
Significance: Autoencoders are powerful for unsupervised learning tasks like anomaly detection. They learn to compress and reconstruct data, making them sensitive to anomalies.
We use PyTorch's optimizers and loss functions to train the model.
Significance: PyTorch's dynamic computation graph allows for flexible and efficient training of deep learning models.
We use the trained model to reconstruct input data and calculate reconstruction error.
Significance: High reconstruction error indicates potential anomalies, providing a data-driven approach to detect unusual network traffic patterns.
We demonstrate how to save and load PyTorch models.
Significance: This feature is crucial for deploying models in production environments, allowing for model persistence and easy transfer between systems.
We use matplotlib to visualize the reconstruction error and anomaly threshold.
Significance: Visualization helps in understanding the model's performance and interpreting the results, which is crucial for data engineers and analysts.