A Complete Workflow for Log File Anomaly Detection with R, H2O and MinIO

A Complete Workflow for Log File Anomaly Detection with R, H2O and MinIO

Processing log files in an enterprise environment is no small undertaking. Manual analysis requires expert knowledge and is time-consuming, making it costly and inefficient. Instead, many organizations apply machine learning (ML) techniques to automatically process incoming logs effectively and efficiently.  

A workflow of this type consists of multiple processing steps. Each step can be triggered when it is notified of the arrival of an artifact (in this case a log file chunk) to be processed. In case an audit is needed, the artifacts are saved after each step in the workflow for a period of time before being deleted. This architecture provides a huge advantage because it allows each step in the workflow to be stateless. State is contained in the artifact itself as it advances through the processing pipeline.

With stateless transformations, each step can be scaled independently as needed by deploying more processing units (additional instances of the code for a given workflow step). Workflows such as log file analysis are perfect for the dynamic scaling (both up and down) nature of Kubernetes, but Kubernetes is not necessary to take advantage of this approach.

In this tutorial we will first develop the necessary components, train the models, and create a production-capable log-processing workflow before putting it all together.

Anomaly detection is an area of Machine Learning (ML) that is powerful and applicable to many domains. In the previous post (Anomaly Detection with R, H2O, and MinIO) I provided a deep explanation of how anomaly detection works using the MNIST data set. As in that post, we’re going to use H2O, R, and Rstudio, and we will apply the same techniques to detecting anomalies in Apache access log files. If you’re looking for more basics around using these tools for anomaly detection, please see the earlier post.

The Apache access logs store information about the requests that occurred on your Apache web server. For instance, when someone visits your website, or makes an http or https request, a log entry is stored to provide the Apache web server administrator with information such as the IP address of the visitor, what pages they were viewing, status codes of the request, browser used, or the size of the response.

There are many reasons to analyze these files. They are a record of web server requests and are important because they provide insight into the usage patterns of those making requests of the web server. One aspect that is of interest is the number and pattern of requests that might be of a nefarious nature. Being able to identify unusual request patterns can provide insight into potential attacks.

MinIO is high-performance software-defined S3-compatible object storage, making it a powerful and flexible replacement for Amazon S3. The S3 API is the current standard for working with ML and associated data sets. MinIO will be used to persist the artifacts in this workflow.  In order to follow along, please install R and RStudio, have access to an H2O cluster, and if you aren’t already running MinIO, please download and install it.

This blog post is meant to be a starting point for developing a custom AI/ML-based anomaly detection system for log files. Each organization typically customizes the types of scans and detection methods they deploy. Apache access log files contain the details of all inbound accesses to the Apache web server. The format can be customized, so if the example log used here doesn’t match your organization’s log file formats, please adjust the code appropriately.

The sample log file

This tutorial uses a publicly available log file that is roughly 1.5GB.

Example lines from the sample log file: - - [19/Dec/2020:14:08:06 +0100] "GET /apache-log/access.log HTTP/1.1" 200 233 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "-" - - [19/Dec/2020:14:08:08 +0100] "GET /favicon.ico HTTP/1.1" 404 217 "http://www.almhuette-raith.at/apache-log/access.log" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "-" - - [19/Dec/2020:14:14:26 +0100] "GET /robots.txt HTTP/1.1" 200 304 "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "-" - - [19/Dec/2020:14:16:44 +0100] "GET /index.php?option=com_phocagallery&view=category&id=2%3Awinterfotos&Itemid=53 HTTP/1.1" 200 30662 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)" "-" - - [19/Dec/2020:14:29:21 +0100] "GET /administrator/index.php HTTP/1.1" 200 4263 "" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" "-"

Create MinIO buckets

We will create a few buckets in MinIO to hold the various artifacts as the workflow executes as seen here in the MinIO console

Below you can see the buckets created in the MinIO console.

The first bucket we use is access-log-files.  This is where the log files land. They can be delivered using many mechanisms, Apache Kafka being one we see a lot in production (see MinIO integrations for more details). MinIO Lambda Compute Bucket Notification will then be used to trigger processing as each file arrives and is written into the MinIO bucket.

Underlying functions

Before we can run a production log file processing workflow we need a trained anomaly detection model to apply.  Training the model requires creating a training set - an R dataframe that contains the instances of the log requests intended for training.

For this tutorial the required library calls have been gathers into one file that is sources.  The file packages.R loads the required libraries.  The file PreProcessLogFile.R contains a function that converts an incoming log chunk into a dataframe.  The code reads in the data, drops some columns, renames other columns, manipulates the time components, and augments the log data with continent and country information based on the source IP address using an external data source, https://www.maxmind.com.  After this processing of the raw data into a more usable dataframe we will store the dataframe in the access-log-dataframes bucket.

==== packages.R
#load necessary libraries

if (!require("plumber")) {

if (!require("jsonlite")) {

if (!require("aws.s3")) {

if (!require("rgeolocate")) {

if (!require("lubridate")) {

if (!require("h2o")) {
==== end packages.R
==== PreProcessLogFile.R
PreProcessLogFile <- function (srcBucket,srcObject,destBucket,destObject) {
  # set the credentials this r instances uses to access minio
  Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin", # enter your credentials
             "AWS_SECRET_ACCESS_KEY" = "minioadmin", # enter your credentials
             "AWS_DEFAULT_REGION" = "",
             "AWS_S3_ENDPOINT" = "")
  b <- get_bucket(bucket = srcBucket, use_https = F, region = "")
  df_access <- aws.s3::s3read_using(FUN = read.table, object = srcObject, bucket = b, 
                                    opts = list(use_https = FALSE, region = ""))
  # join the time string back together
  df_access$V4 <- paste(df_access$V4, df_access$V5)
  # remove some noise
  drops <- c("V2", "V3", "V5", "V11") 
  df_access <-df_access[, !(names(df_access) %in% drops)]
  # rename the columns
  names(df_access)[1] <- "client_ip"
  names(df_access)[2] <- "access_time"
  names(df_access)[3] <- "client_request"
  names(df_access)[4] <- "status_code"
  names(df_access)[5] <- "response_size"
  names(df_access)[6] <- "referrer"
  names(df_access)[7] <- "user_agent"
  # load the Geolocation data
  # https://www.maxmind.com
  ips<- df_access$client_ip
  maxmind_file <- "data/Geo/Geolite2-Country_20220517/Geolite2-Country.mmdb"
  country_info <- maxmind(
    fields = c("continent_name", "country_name", "country_code")
  # add some columns for location information
  df_access$continent_name <- as.factor(country_info$continent_name)
  df_access$country_name <- as.factor(country_info$country_name)
  df_access$country_code <- as.factor(country_info$country_code)
  # use lubridate to help coerce the access_time string into something usable
  # we might be interested in what day of week, hour, min, second an access occurred - as factors
  df_access$access_time <- dmy_hms(df_access$access_time)
  df_access$wday <- wday(df_access$access_time)
  df_access$access_hour <- as.factor(format(df_access$access_time, format = "%H"))
  df_access$access_min <- as.factor(format(df_access$access_time, format = "%M"))
  df_access$access_sec <- as.factor(format(df_access$access_time, format = "%S"))
  # since trying to train an identity function for anomaly detection using the
  # exact access_time is difficult, and doesn't really help, remove the access_time
  df_access <- subset(df_access, select = -c(access_time))
  # all the inputs need to be numeric or factor, so clean up the rest
  df_access$client_ip <- as.factor(df_access$client_ip)
  df_access$client_request <- as.factor(df_access$client_request)
  df_access$status_code <- as.factor(df_access$status_code)
  df_access$response_size <- as.numeric(df_access$response_size)
  df_access$referrer <- as.factor(df_access$referrer)
  df_access$user_agent <- as.factor(df_access$user_agent)
  # save off the munged data frame - don't want to do this pre-processing again
  b <- get_bucket(bucket = destBucket, region = "", use_https = F)
  s3write_using(df_access, FUN = saveRDS, object = destObject, bucket = b, 
                opts = list(use_https = FALSE, region = "", multipart = TRUE))

==== end PreProcessLogFile.R

Training the anomaly detection autoencoder

Once the data is organized and in a dataframe, we can use H2O to train a deep learning autoencoder fairly easily. The process is to read in the dataframe, split it for training and test, identify the predictors, and, finally, train the model. Once the model is trained it is saved back into the bin-models MinIO bucket for use in the workflow. How to accomplish this, plus what anomalies are and how an autoencoder detects them was previously discussed in Anomaly Detection using R, H2O, and MinIO.

==== TrainModel.R


# set the credentials this r instances uses to access minio
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin", # enter your credentials
           "AWS_SECRET_ACCESS_KEY" = "minioadmin", # enter your credentials
           "AWS_DEFAULT_REGION" = "",
           "AWS_S3_ENDPOINT" = "") 

# initialize the h2o server
# h2o.init(ip="", port=54321,startH2O=FALSE)
# h2o.set_s3_credentials("minioadmin", "minioadmin")
h2o.init(jvm_custom_args = "-Dsys.ai.h2o.persist.s3.endPoint= -Dsys.ai.h2o.persist.s3.enable.path.style=true")
h2o.set_s3_credentials("minioadmin", "minioadmin")

# turn the log chunk into a dataframe
bucketName <- "access-log-files"
objectName <- "access_sample.log"
destBucketName <- "access-log-dataframes"
destObjectName <- "access-log-dataframe.rda"
PreProcessLogFile(bucketName, objectName, destBucketName, destObjectName)

#load the previously pre-processed dataframe
b <- get_bucket(bucket = 'access-log-dataframes', use_https = F, region ="")
df_access <-s3read_using(FUN = readRDS, object = "access-log-200000.rda", bucket = b, 
                         opts = list(use_https = FALSE, region = ""))

# load into h2o and convert into h2o binary format
df_access.hex = as.h2o(df_access, destination_frame= "df_access.hex")

# split this dataframe into a train and test set
splits <- h2o.splitFrame(data = df_access.hex, 
                         ratios = c(0.6),  #partition data into 60%, 40%
                         seed = 1)  #setting a seed will guarantee reproducibility
train_hex <- splits[[1]]
test_hex <- splits[[2]]

#save the validate set to csv, going to use it in another step
b <- get_bucket(bucket = 'access-log-dataframes', region = "", use_https = F)
s3write_using(test_hex, FUN = saveRDS, object = "access-log-test.rda", bucket = b, 
              opts = list(use_https = FALSE, region = "", multipart = TRUE))

predictors <- c(1:13)

# use the training data to create a deeplearning based autoencoder model
# about 3 million tunable parameters with the factorization of the fields
ae_model <- h2o.deeplearning(x=predictors,

# save the model as bin
model_path <- h2o.saveModel(ae_model, path = "s3://bin-models/apache-access-log-file-autoencoder-bin")

==== end TrainModel.R

Applying the anomaly detection autoencoder

Once we have a trained model we can apply the model to new log chunks as they arrive to identify anomalies. Below is a file called IdentifyAnomalies.R which contains the function that is applied.

==== IdentifyAnomalies.R

IdentifyAnomalies <- function(srcBucketName, srcObjectName, destBucketName, destObjectName, modelBucket, modelName) {
  # set the credentials this r instances uses to access minio
  Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin", # enter your credentials
             "AWS_SECRET_ACCESS_KEY" = "minioadmin", # enter your credentials
             "AWS_DEFAULT_REGION" = "",
             "AWS_S3_ENDPOINT" = "<MinIO-IP-Address>:9000") 
  # initialize the h2o server
  h2o.init(ip="<MinIO-IP-Address>", port=54321,startH2O=FALSE)
  h2o.set_s3_credentials("minioadmin", "minioadmin")
  # load the model that was previously saved
  # the name of the model needs to have been saved off somewhere so this specific model can be loaded
  model_path <- paste0("s3://",modelBucket,"/",modelName)
  ae_model <- h2o.loadModel(model_path)
  #load the previously pre-processed dataframe
  b <- get_bucket(bucket = srcBucketName, use_https = F, region ="")
  test_access <-s3read_using(FUN = readRDS, object = srcObjectName, bucket = b, 
                             opts = list(use_https = FALSE, region = ""))
  # load into h2o and convert into h2o binary format
  test_hex = as.h2o(test_access, destination_frame= paste0(srcObjectName,".hex"))
  # h2o.anomaly computes the per-row reconstruction error for the test data set
  # (passing it through the autoencoder model and computing mean square error (MSE) for each row)
  test_rec_error <- as.data.frame(h2o.anomaly(ae_model, test_hex)) 
  listAccesses <- function(data, rec_error, rows) {
    row_idx <- order(rec_error[,1],decreasing=F)[rows]
    my_rec_error <- rec_error[row_idx,]
    my_data <- as.data.frame(data[row_idx,])
  # These are the biggest outliers
  num_rows = nrow(test_hex)
  beg <- floor(num_rows-20)
  end <- num_rows
  anomaly_df <- listAccesses(test_r, test_rec_error, c(beg:end))
  anomaly_df <- anomaly_df[,c("country_name","wday","access_hour")]
  # write this dataframe into the next bucket
  b <- get_bucket(bucket = destBucketName, region = "", use_https = F)
  s3write_using(anomaly_df, FUN = saveRDS, object = destObjectName, bucket = b, 
                opts = list(use_https = FALSE, region = "", multipart = TRUE))

==== end IdentifyAnomalies.R

Building the inference workflow using a webhook

MinIO Lambda Compute Bucket Notification is used to build an event driven workflow. When a log file arrives an event is created. These events trigger the progression of the artifacts through the inference processing for new log file chunks. MinIO Lamba Compute Bucket Notification can be integrated with many notification mechanisms. For this tutorial we will be using integration with a webhook. R supports creating a RESTful web interface using the Plumber package. To use plumber, a file needs to define the access method/endpoint tuples and the functional code to execute for each of the endpoints. Below is the Plumber.R file for this tutorial. Plumber sources the two other files previously described with the functions PreProcessLogFile() and IdentifyAnomalies().  In the top of this file is an example of a JSON bucket notification event.

Plumber takes this annotated R script and turns it into an executable web server.  Below the endpoints are defined in the Plumber.R file

==== Plumber.R

# Plumber.R

# Each of these has a function used below

#* Log some information about the incoming request
#* @filter logger
  cat(as.character(Sys.time()), "-",
      req$REQUEST_METHOD, req$PATH_INFO, "-",
      req$HTTP_USER_AGENT, "@", req$REMOTE_ADDR, "\n")

#* endpoint handler for the MinIO bucket notifications
#* @post /
  json_text <- req$postBody
  # {
  #   "EventName": "s3:ObjectCreated:Put",
  #   "Key": "access-log-files/access_sample_short.log",
  #   "Records": [
  #     {
  #       "eventVersion": "2.0",
  #       "eventSource": "minio:s3",
  #       "awsRegion": "",
  #       "eventTime": "2022-08-10T18:19:38.663Z",
  #       "eventName": "s3:ObjectCreated:Put",
  #       "userIdentity": {
  #         "principalId": "minioadmin"
  #       },
  #       "requestParameters": {
  #         "principalId": "minioadmin",
  #         "region": "",
  #         "sourceIPAddress": "<MinIO-IP-Address>"
  #       },
  #       "responseElements": {
  #         "content-length": "0",
  #         "x-amz-request-id": "170A0EB32509842A",
  #         "x-minio-deployment-id": "e88d6f13-657f-4641-b349-74ce2795d730",
  #         "x-minio-origin-endpoint": "<MinIO-IP-Address>:9000"
  #       },
  #       "s3": {
  #         "s3SchemaVersion": "1.0",
  #         "configurationId": "Config",
  #         "bucket": {
  #           "name": "access-log-files",
  #           "ownerIdentity": {
  #             "principalId": "minioadmin"
  #           },
  #           "arn": "arn:aws:s3:::access-log-files"
  #         },
  #         "object": {
  #           "key": "access_sample_short.log",
  #           "size": 1939,
  #           "eTag": "bb48fe358c017940ecc5fb7392357641",
  #           "contentType": "application/octet-stream",
  #           "userMetadata": {
  #             "content-type": "application/octet-stream"
  #           },
  #           "sequencer": "170A0EB3F219CCC4"
  #         }
  #       },
  #       "source": {
  #         "host": "<MinIO-IP-Address>",
  #         "port": "",
  #         "userAgent": "MinIO (linux; amd64) minio-go/v7.0.34"
  #       }
  #     }
  #   ]
  # }
  # extract the raw JSON into a data structure
  j <- fromJSON(json_text, flatten = TRUE)
  # get the eventName, the bucketName, and the objectName
  # for this example we know it's a put so we can ignore the eventName
  eventName <- j[["EventName"]]
  bucketName <- j[["Records"]]$s3.bucket.name
  objectName <- j[["Records"]]$s3.object.key
  destBucketName <- "access-log-dataframes"
  destObjectName <- "access-log-dataframe.rda"
  # turn the log chunk into a dataframe
  PreProcessLogFile(bucketName, objectName, destBucketName, destObjectName)
  # the destination for this step becomes the src for the next
  # rename then just to maintain sanity
  srcBucketName <- destBucketName
  srcObjectName <-destObjectName
  destBucketName <- "access-log-anomaly-dataframes"
  destObjectName <- "access-log-anomaly-dataframe.rda"
  modelBucketName <- "bin-models/apache-access-log-file-autoencoder-bin"
  modelName <- <The-Name-Of-The-Built-Model>
  # use the trained model and identify anomalies in the log chunk that just arrived
  IdentifyAnomalies(srcBucketName, srcObjectName, destBucketName, destObjectName, modelBucketName, modelName)

==== end Plumber.R

Once we have the file that defines the endpoints for this RESTful web interface we need to create a file that parses the annotated script and starters the server listening. We’ll create the file Server.R to accomplish this.

==== Server.R
# the REST endpoint server

#source the required packages libraries

# process the Plumber.R file and show the valid endpoints
root <- pr("Plumber.R")

# make the endpoints active
root %>% pr_run(host = "<REST-endpoint-IP-Address>", port = 8806)

==== end Server.R

We’ve configured R to act as a RESTful web server is notified of bucket events via webhook. We need to start the web server by running Server.R in R Studio. We do this because MinIO will verify that the web server exists and is running when we try to configure the webhook. Once the server is running we can configure MinIO to use the webhook.

Configuring Lambda Compute Bucket Notification with a webhook

There are two steps to configuring MinIO to notify. The first is to configure the end point in the MinIO cluster as a destination for notifications, which is accomplished using the MinIO mc client. In my case, the webhook listener is running on my laptop, at an IP address of, on port 8806. Please adjust for your environment.

mc admin config set myminio notify_webhook:preProcessLogFiles queue_limit="0"  queue_dir="" endpoint=""

You will need to restart the MinIO cluster after setting this configuration.

The second step is to configure the specifics of when MinIO should notify. Here notification happens when an object is put into the myminio/access-log-files bucket.

mc event add myminio/access-log-files arn:minio:sqs::preProcessLogFiles:webhook --event put

We have laid the groundwork for the workflow. When a log file is PUT into the access-log-files bucket, the code above is triggered to transform the log file into a dataframe and then the trained model is applied to identify anomalies in the http requests.

Event-driven ML anomaly workflow in action

Next, we’ll copy a log file into the access-log-files bucket to trigger the workflow.

I’ve created a small(er) log file of only 300k lines from the overall sample file. When I PUT it into the bucket, MinIO sends a notification event that kicks off the anomaly detection workflow.  

The IdentifyAnomalies() function creates and saves a dataframe to hold the anomalies - those instances with the highest reconstruction error.  When I copied the smaller sample log file to the access-log-files bucket it kicked off the workflow.  The result is that the anomalies_df dataframe was written to the bucket.  In a production workflow the arrival of this dataframe might trigger Lambda Compute Bucket Notification to further process the contents of this dataframe, such as adding these rows to a system for further examining these requests.

Below are the 21 requests with the highest reconstruction error based on the trained DeepLearning Autoencoder. I reduced the number of columns so they could be easily examined. Remember that being an anomaly only indicates that the instance being considered exists in the input vector space at a distance from the training data that was used to train the anomaly detection autoencoder. Therefore it’s extremely important that the autoencoder be trained with training instances that represent the range of values that is considered normal. These instances with high reconstruction error should be examined further to determine whether they are an issue.

Although one wouldn't do this in a production workflow, we can also examine these visually to see if any patterns are evident using the script below:

==== PlotAnomalies.R

# set the credentials this r instances uses to access minio
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin", # enter your credentials
           "AWS_SECRET_ACCESS_KEY" = "minioadmin", # enter your credentials
           "AWS_DEFAULT_REGION" = "",
           "AWS_S3_ENDPOINT" = "<MinIO-IP-Address>:9000") 

b <- get_bucket(bucket = "access-log-anomaly-dataframes", region = "", use_https = F)
df <- s3read_using(FUN = readRDS, object = "access-log-anomaly-dataframe.rda", bucket = b, 
              opts = list(use_https = FALSE, region = "", multipart = TRUE))

# now lets look at the results
df$access_time <- (df$wday*24) +  as.numeric(as.character(df$access_hour))

jitter <- position_jitter(width = 0.2, height = 0.2)
p<-ggplot() +
  layer(data = df,
        stat = "identity",
        geom = "point",
        mapping = aes(x = country_name, y = access_time, color = "red"),
        position = jitter) +
  theme(axis.text.x = element_text(angle = 90))

===== end PlotAnomalies.R

When the entries are graphed by country and access_time (here access_time is wday * 24 + access_hour) we start to see some clustering:

We’re at the end of the tutorial, but it is important to understand this wouldn’t be the end of a log file analysis workflow. This blog post covered taking a raw log file, transforming it into a useful dataframe, then training an anomaly detection deep learning autoencoder and saving off the trained model. We then used the log preprocessing code and applied the trained model for inference against new log chunks utilizing MinIO Lambda Compute Bucket Notification to drive the workflow.

Getting started with a workflow for log file anomaly detection

Starting with the right tools simplifies building ML data pipelines, decreasing the time and effort required to gain insight from raw data.  

The technologies highlighted in this blog post — R, H2O and MinIO — form a powerful, flexible and speedy ML toolbox. This tutorial provided an example of using anomaly detection for a standardized log file format. Processing the log file data necessarily involves pre-processing the raw data in reasonable ways — and the definition of “reasonable” continues to evolve as this is an ongoing area of research. MinIO serves as a powerful foundational tool for event-driven data pre-processing and ML data pipelines no matter how you define reasonable in your organization.

Download MinIO and build your ML toolkit today. If you have any questions, please send us an email at hello@min.io, or join the MinIO slack channel and ask away.

Previous Post Next Post