HTTP Status Code Prediction with Machine Learning

With the ever-increasing sophistication of digital systems, the need to monitor, understand, and anticipate issues in these systems has become paramount. One of the primary ways of doing this is through log analysis – the practice of examining log entries generated by system activities.

In this blog post, we’ll take a deep dive into a specific application of machine learning in log analysis: Predicting HTTP status codes from log messages. To bring this to life, we’ll walk through a Python script that illustrates how to implement this prediction model.

HTTP Status Codes: The Heartbeat of the Web

HTTP status codes are three-digit responses sent by a server in response to a client’s request made to the server. These codes provide quick information about the result of a request – was it successful, unauthorized, not found, or resulted in a server error? Predicting these codes can give us an insight into future issues and enable us to proactively handle potential problems, resulting in better system reliability and user experience.

The Machine Learning Approach

We’ll use a supervised machine learning model to predict HTTP status codes based on the associated log messages. Specifically, we’ll leverage a random forest classifier, which is a powerful ensemble learning method known for its versatility and robustness.

A Deep Dive into the Script

import csv
import json
import re
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import RandomOverSampler

# Function to preprocess JSON strings
def preprocess_json_string(json_string):
    return json_string.replace('""', '"').strip('"')

# Function to extract status codes from JSON strings
def extract_status_code(json_string):
    match = re.search(r'"status_code":(\d+)', json_string)
    return int(match.group(1)) if match else None

# Read log lines from a CSV file
log_lines = []
with open('log.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        log_lines.append(row['Message'])

# Preprocess log lines and extract status codes
preprocessed_log_lines = [preprocess_json_string(log_line) for log_line in log_lines]
status_codes = [extract_status_code(log_line) for log_line in preprocessed_log_lines]

# Create a DataFrame
data = pd.DataFrame({'log_line': preprocessed_log_lines, 'status_code': status_codes})

# Remove rows with missing values
data = data.dropna()

# Save the extracted data to a new CSV file
data.to_csv('extracted_data.csv', index=False)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data['log_line'], data['status_code'], test_size=0.2, random_state=42)

# Oversample the training data
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(pd.DataFrame(X_train), y_train)

# Create a pipeline with TfidfVectorizer and RandomForestClassifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', RandomForestClassifier(random_state=42))
])

# Train the model
pipeline.fit(X_train_resampled['log_line'], y_train_resampled)

# Predict status codes for the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Our Python script performs the prediction task in a series of well-defined steps. Let’s dive into each one:

Step 1: Data Preparation We start by reading the log lines from a CSV file named ‘log.csv’. These lines are preprocessed and HTTP status codes are extracted using custom Python functions. The preprocessed log lines and corresponding status codes are stored in a pandas DataFrame.

Step 2: Data Cleaning Next, we remove rows with missing values. This ensures that our model is trained with high-quality data.

Step 3: Data Splitting The data is split into a training set and a test set. The training set is used to train the model, while the test set is used to evaluate the model’s performance.

Step 4: Handling Class Imbalance Real-world datasets often suffer from class imbalance – certain classes (in our case, certain HTTP status codes) have fewer samples than others. To overcome this, we oversample the minority classes in the training data, ensuring that our model doesn’t become biased towards the majority class. This step can be avoided in case sufficient data is available.

Step 5: Model Training We build a machine learning pipeline that first transforms our log lines into numerical features using TfidfVectorizer and then feeds these features into a RandomForestClassifier. The model is trained using the oversampled training data.

Step 6: Model Evaluation Finally, we evaluate the model’s performance on the test data using a classification report and confusion matrix. These metrics help us assess the model’s predictive power and accuracy.

Classification Report

Class     Precision    Recall   F1-score   Support
---------------------------------------------------
200.0       1.00        0.93       0.96      7448
201.0       0.32        0.95       0.48       424
400.0       1.00        1.00       1.00      1062
401.0       0.17        0.99       0.30        98
403.0       0.00        0.00       0.00         2
404.0       0.74        1.00       0.85       207
412.0       0.94        0.98       0.96        96
422.0       1.00        0.96       0.98     20495
---------------------------------------------------
Accuracy                            0.95     29832
Macro Avg   0.65        0.85       0.69     29832
Weighted Avg 0.98       0.95       0.96     29832

Confusion Matrix

Class   200.0 201.0 400.0 401.0 403.0 404.0 412.0 422.0
----------------------------------------------------------
200.0   6953   455     0    12     3    22     3     0
201.0      1   404     0     0    19     0     0     0
400.0      1     0  1061     0     0     0     0     0
401.0      1     0     0    97     0     0     0     0
403.0      2     0     0     0     0     0     0     0
404.0      0     0     0     0     0   207     0     0
412.0      1     0     0     0     0     1    94     0
422.0     24   385     0   448     3    49     3 19583

The Bigger Picture

The ability to predict HTTP status codes from log messages is just one example of how machine learning can bring value to log analysis. By providing insights into future system behavior, we can be proactive in maintaining and improving system performance, ultimately enhancing user experience.

This example underscores the potential of machine learning in making sense of unstructured text data. As we continue to generate vast amounts of log data, the use of machine learning in log analysis is set to become an increasingly important tool in our digital toolbox.

Similar Posts