MinIO as ElasticSearch Frozen Tier

MinIO as ElasticSearch Frozen Tier

In order to balance performance, capacity and cost, ElasticSearch has historically supported several types of tiers: Hot, Warm and Cold. Hot tier holds data that is actively used and is the latest and greatest data available at high speed, perhaps from NVMe. Warm tier is used for cost conscious data where speed and performance are sacrificed a little and the data is put on slower hardware, perhaps HDD. Cold storage is generally local storage used to house single replicas where data doesn’t need to be accessed as often, which conserves storage capacity by avoiding storing multiple copies of the same file.

What is common between Hot, Warm and Cold tiers is that there is at least one local copy of the data on each of these configurations. Besides the above tiers, ElasticSearch also supports the frozen tier. The frozen tier differs from the other three tiers because it doesn’t store any data locally. All the data in this tier is completely stored in an object store such as MinIO. Once data is stored in the MinIO object store, you can use ElasticSearch Mount API to partially mount the frozen tier data to make it searchable.

Why is it important to design your Elasticsearch deployment to around multiple tiers? Why not have a single tier and keep adding new and removing old data?

The short answer is that we deploy Elasticsearch to help us troubleshoot systems, so we need to design systems that mirror our workflow and enable rapid diagnosis and correction. As a DevOps engineer, one of the most important things at your disposal is the amount of logs and metrics you have on hand for correlation purposes. By examining historical data, you will be able to see trends, find patterns and act accordingly. If you have a service that misbehaves seemingly at random, you would want to go back as far as possible to identify a pattern and uncover the cause. However, storing 2 years worth of searchable data on fast storage (such as NVMe) can take up a lot of costly drive space. Generally data from the last few weeks is probably the most searched by folks in the organization so that is stored on the fastest hardware. But as the data gets older, it becomes less useful for immediate troubleshooting and doesn’t need to be searched as quickly as the rest.

As a DevOps engineer, you might tier off your data as follows:

Tier

Age

Intended Usage

Storage Technology

Hot

0-1 Month

Latest queries

NVMe Drives

Warm

1-3 Months

Minimal queries

SSD Drives

Cold

3-6 Months

Infrequent queries

Spindle HDDs

Frozen

> 6 Months

Historical queries

Object Store

MinIO is frequently used to store Elasticsearch snapshots and if you use MinIO to store your Elasticsearch frozen tier, you can use the same MinIO API you already know and love to search these snapshots. Elasticsearch is more efficient when used with storage tiering, which decreases the total cost of ownership for Elasticsearch, plus you get the added benefits of writing data to MinIO that is immutable, versioned and protected by erasure coding. In addition, using Elasticsearch tiering with MinIO object storage makes data files available to other cloud native machine learning and analytics applications.

MinIO is the perfect companion for Elasticsearch because of its industry-leading performance and scalability, which puts every data-intensive workload, not just Elasticsearch, within reach. MinIO is capable of tremendous performance - a recent benchmark achieved 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with just 32 nodes of off-the-shelf NVMe SSDs.

In addition to speed and scalability, MinIO supports these enterprise features right out of the box, which makes it easier to run MinIO as a frozen tier backend.

  • Secure Access ACLs and PBAC: Supports IAM S3-style policies with built in IDP, see MinIO Best Practices - Security and Access Control for more information.
  • Tiering: For data that doesn’t get accessed as often you can siphon off data to another cold storage running MinIO so you can optimize the latest data on your best hardware without the unused data taking space.
  • Object Locking and Retention: MinIO supports object locking (retention) which enforces write once and ready many operations for duration based and indefinite legal hold. This allows for key data retention compliance and meets SEC17a-4(f), FINRA 4511(C), and CFTC 1.31(c)-(d) requirements.
  • MinIO’s Key Encryption Service (KES) brings this all together. SSE uses KES and Vault to perform cryptographic operations. The KES service itself is stateless and acts as a middle layer as it stores its data in Vault. With MinIO you can set up various levels of granular and customizable encryption. You can always choose to encrypt on a per object basis, however, we strongly recommend setting up SSE-KMS encryption automatically on the buckets so all objects are encrypted by default. Encryption is accomplished using a specific External Key (EK) stored in Vault, which can be overridden on a per object basis with a unique key.

By saving the Elasticsearch frozen tier in MinIO, we gain visibility into patterns and alerts shown in a Kibana graphical interface so you can run further analysis, and even alerting based on certain thresholds. For example, you might want to check for performance trends or bottlenecks and try to identify patterns in workload type or time of day. You can explore all the historical data you need without manually having to restore and delete data from your ElasticSearch cluster.

When configuring different tiers in Elasticsearch, you will sacrifice a bit of performance when searching frozen tiers because none of the data is locally stored. It is therefore important to manage your data files and place them in the required tiers based on your specific organizational needs. ElasticSearch allows Kibana to query the frozen tier in the backend and display the results when available. It also provides precision when querying data, which will index only the data that is being queried and pull down only the necessary data.

Next, we’ll show you how to configure Elasticsearch to rollover the data to a frozen tier saved in MinIO using Index Lifecycle Management (ILM) policy.

Setting Up the Elasticsearch Frozen Tier

In order to set up the frozen tier, we need to install MinIO in a distributed mode so you have another layer of redundancy for your data, and we’ll install an ElasticSearch node as well. We'll perform all the below commands as root user.

MinIO

Download and install the latest version of MinIO from the archive.

wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20221212192727.0.0_amd64.deb -O minio.deb
dpkg -i minio.deb

Create a user and group for MinIO service to run as

groupadd -r minio-user
useradd -M -r -g minio-user minio-user

Attach a drive then format and mount the volume

parted /dev/xvdb mklabel msdos
parted -a opt /dev/xvdb mkpart primary ext4 0% 100%
mkfs.ext4 -L minio-data /dev/xvdb1
mkdir -p /mnt/minio/data
echo "LABEL=minio-data /mnt/minio/data ext4 defaults 0 2" >> /etc/fstab
mount -a

Ensure the permissions of the mounted directory matches the user group created earlier

chown minio-user:minio-user /mnt/minio/data

Replicate this node configuration to 3 more nodes, totaling 4 nodes.

Once you have the 4 nodes configured, configure their /etc/default/minio as follows

echo "MINIO_VOLUMES=\"http://server-{1...3}.minio.local:9000/mnt/minio/data\"" >> /etc/default/minio
echo "MINIO_ROOT_USER=\"minioadmin\"" >> /etc/default/minio
echo "MINIO_ROOT_PASSWORD=\"minioadmin\"" >> /etc/default/minio

Using something like Ansible, start MinIO on all 4 nodes

systemctl start minio.service

Create a bucket in MinIO for the frozen tier to be stored

mc alias set minio http://<minio_ip>:9000 minioadmin minioadmin
mc mb minio/esfrozentier

ElasticSearch

Download the repository to install  ElasticSearch

#  wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
#  sudo apt-get install apt-transport-https jq
#  echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list

Install ElasticSearch and its related packages. Also set your PATH to include the ElasticSearch bin directory

# sudo apt-get update
# sudo apt-get install elasticsearch
# export PATH=/usr/share/elasticsearch/bin:$PATH

Open the ElasticSearch configuration file to update a few settings

vim /etc/elasticsearch/elasticsearch.yml

Add the following settings:

  • Disable xpack security (be sure to enable this in production)
  • Set the node role to frozen tier
xpack.security.enabled: false
node.roles: ["data_frozen"]

Save the configuration file. For more information on the different types of tiers, be sure to check this ElasticSearch blog post about data lifecycle management.

Start the ElasticSearch service

#  sudo systemctl daemon-reload
#  systemctl start elasticsearch.service

It will take a minute of two for the ElasticSearch service to start. Once it has started, verify it is working as expected

# curl -s http://localhost:9200/_cluster/health | jq .
{
    "cluster_name": "elasticsearch",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 4,
    "active_shards": 4,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100
}

In order to configure S3 in ElasticSearch we need to install the plugin and restart ElasticSearch

# elasticsearch-plugin install --batch repository-s3
# systemctl restart elasticsearch

Configure the ElasticSearch keystore with the MinIO cluster admin credentials

#  echo "minioadmin" |  elasticsearch-keystore add --stdin --force s3.client.default.access_key
#  echo "minioadmin" |  elasticsearch-keystore add --stdin --force s3.client.default.secret_key

Configure _snapshot with our MinIO frozen tier bucket

curl -k -X PUT http://localhost:9200/_snapshot/frozentier_minio_repository?pretty -H 'Content-Type: application/json' -d '{
>   "type": "s3",
>   "settings": {
>     "bucket": "esfrozentier",
>     "endpoint": "http://<minio_ip>:9000",
>     "protocol": "http",
>     "path_style_access": "true",
>     "max_restore_bytes_per_sec": "5gb",
>     "max_snapshot_bytes_per_sec": "300mb"
>   }
> }'
{
"acknowledged" : true
}

Verify the snapshot repository has been configured correctly

# curl -k -X POST http://localhost:9200/_snapshot/frozentier_minio_repository/_verify?pretty
{
  "nodes" : {
    "G_XlUdlKQr-sP_4QQ6hPtA" : {
      "name" : "ip-10-0-25-41"
    }
  }
}

Now let's configure an ILM policy called minio_frozen where we can rollover the data from hot to frozen tier on specific parameters. We’re omitting the warm tier for simplicity. In this case, if the data is 30 days old or more than 1 TB, whichever happens first, we will roll that data over to the frozen tier.

# curl -k -X PUT http://localhost:9200/_ilm/policy/minio_frozen?pretty -H 'Content-Type: application/json' -d '{
  "policy": {
    "phases" : {
      "hot" : {
        "actions" : {
          "rollover" : {
            "max_age" : "30d",
            "max_primary_shard_size" : "1tb"
          },
          "forcemerge" : {
            "max_num_segments" : 1
          }
        }
      },
      "frozen" : {
        "min_age" : "24h",
        "actions" : {
          "searchable_snapshot": {
            "snapshot_repository" : "frozentier_minio_repository"
          }
        }
      },
      "delete" : {
        "min_age" : "60d",
        "actions" : {
          "delete" : { }
        }
      }
    }
  }
}'

Once the data is in the frozen tier you can either restore it to a hot tier in another or the same cluster. But, the recommended way is to mount it instead so that you don’t have to make multiple copies of the data and can expose the same data to multiple Elasticsearch clusters

curl -k -X POST http://localhost:9200/_snapshot/frozentier_minio_repository/partial-<snapshot>/_mount?wait_for_completion=true -H 'Content-Type: application/json' -d '{
    "index": "my_docs",
    "renamed_index": "docs",
    "index_settings": {
    	"index.number_of_replicas": 0
    },
    "ignore_index_settings": [ "index.refresh_interval" ]
}'

There you have it, as you can see it’s pretty straightforward to configure and use the Elasticsearch frozen tier with MinIO so you can manage data in a predictable and sustainable manner.

Final Thoughts

Combining MinIO and Elasticsearch makes troubleshooting and log analysis faster and easier. By leveraging MinIO as the frozen tier backend you have the ability to be cloud agnostic in your deployments. No matter whether it's on-prem, in the cloud or a hybrid model, you can rest assured that your team can leverage MinIO wherever you need it based on your tiering needs. Rounding out the tiering picture, MinIO itself supports further tiering where you can siphon off old data to something like Amazon Glacier using Object Lifecycle Management.

By not having all the data in ElasticSearch, you can decrease required storage capacity and therefore realize considerable savings on data storage costs, while at the same time running MinIO on commodity hardware to get the best possible performance to cost ratio. MinIO makes your frozen tier “hot” with industry-leading performance that makes querying the Elasticsearch frozen tier much faster than any other data store at this cost.

If you’ve set up a frozen tier backed by MinIO or have any questions regarding setting it up be sure to reach out to us on Slack!

Previous Post Next Post