Enhance Search with AIStor and OpenSearch

In this post we look at how search, and specifically OpenSearch can help us identify patterns or see trends in our ever growing data. For example if you were looking at operational data, if you have a service that misbehaves seemingly at random, you would want to go back as far as possible to identify a pattern and uncover the cause. This is not just for applications, but also a plethora of logs from every device imaginable that need to be kept around for a reasonable amount of time for compliance and troubleshooting purposes. However, storing several months/years worth of searchable data on fast storage (such as NVMe) can take up a lot of costly drive space. Generally data from the last few weeks is the most searched so that is stored on the fastest hardware. But as the data gets older, it becomes less useful for immediate troubleshooting and doesn’t need to be on expensive hardware - even if it still holds a secret or two.

The question becomes, how do we search this archival data quickly without sacrificing performance. Is it possible to have the best of both worlds?

Cue in OpenSearch; a Apache Lucene based distributed search and analytics engine. You can perform full-text searches on data once it is added to the OpenSearch indices. Any application that requires search has a use case with OpenSearch, for example, you could use it to build a search feature into your application, DevOps engineers can put OpenSearch to work as a log analytics engine and backend engineers can put trace data via collectors such as OpenTelemetry to get better insight into app performance. With built-in, feature-rich search and visualization, you can pinpoint infrastructure issues such as running out of disk space, getting error status codes and more and surface them in dashboards before they wreak havoc on operations.

However, as log data grows, it is sometimes not practical to keep all of it on one node or even a single cluster. We like OpenSearch because it has a distributed design, not unlike AIStor, which stores your data and processes requests in parallel. AIStor is very simple to get up and running with just a simple binary. You can start off on your laptop with a single node single drive configuration and grow that into production with multi-drive, multi-node and multi-site – with the same feature set that you’ve come to expect in Enterprise grade storage software. Not only can you build a distributed OpenSearch cluster but you can also subdivide the responsibilities of the various nodes in the cluster as it grows. You can have nodes with large disks to store data, nodes with a lot of RAM for indexing and nodes with a lot of CPU but less disk to manage the state of the cluster.

As your data grows, you can tier off old/archival data to an AIStor bucket so SSD/NVMe storage can be kept for the latest data being added to the cluster. Moreover, you can search these snapshots directly while they are stored in the AIStor bucket. Logically, when moving snapshots it’s important to consider that accessing data from remote drives is slower than accessing it from local drives, so generally higher latencies on search queries are expected, but the increased storage efficiency is frequently worth it. With AIStor, you are getting the fastest network object storage available. Distributed AIStor on a high speed network can outperform local storage. Unlike other, slower object storage, AIStor can seamlessly access these snapshots as if they were local on the OpenSearch cluster, saving valuable time, network bandwidth and team effort when searching for insights in siphoned log data in a matter of seconds depending on the size of the result. Compare this to local restore that takes hours to perform the operation before even being able to query the data.

OpenSearch is more efficient when used with storage tiering, which decreases the total cost of ownership for it, plus you get the added benefits of writing data to AIStor that is immutable, versioned and protected by erasure coding. In addition, using OpenSearch tiering with AIStor object storage makes data files available to other cloud native machine learning and analytics applications.

Infrastructure

Let's set up OpenSearch and AIStor using Docker and go through some of the features to showcase their capabilities.

OpenSearch

We’ll create a custom Docker image because we’ll build it with a custom plugin to connect to our AIStor object store.

FROM opensearchproject/opensearch:2.8.0

ENV MINIO_ACCESS_KEY_ID minioadmin

ENV MINIO_SECRET_ACCESS_KEY minioadmin

ENV MINIO_ENDPOINT minio:9000

ENV MINIO_PROTOCOL http

RUN /usr/share/opensearch/bin/opensearch-plugin install --batch repository-s3

RUN /usr/share/opensearch/bin/opensearch-keystore create

RUN echo $MINIO_ACCESS_KEY_ID | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.access_key

RUN echo $MINIO_SECRET_ACCESS_KEY | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.secret_key

RUN echo $MINIO_ENDPOINT | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.endpoint

RUN echo $MINIO_PROTOCOL | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.protocol

Build the custom Docker image

docker build --tag=opensearch-minio

Run the following Docker command to spin up a container using the image we downloaded earlier

docker run -d -p 9200:9200 -p 9600:9600 -v /usr/share/opensearch/data -e "discovery.type=single-node" opensearch-minio

Curl to localhost port 9200 with the default `admin` credentials to verify that OpenSearch is running

curl https://localhost:9200 -ku 'admin:admin'

You should see an output similar to below

{
"name" : "a937e018cee5",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "GLAjAG6bTeWErFUy_d-CLw",
"version" : {
"distribution" : "opensearch",
"number" : <version>,
"build_type" : <build-type>,
"build_hash" : <build-hash>,
"build_date" : <build-date>,
"build_snapshot" : false,
"lucene_version" : <lucene-version>,
"minimum_wire_compatibility_version" : "7.10.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "The OpenSearch Project: https://opensearch.org/"
}

We can check the container status as well

$ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a937e018cee5 opensearch-minio "./opensearch-docker..." 19 minutes ago Up 19 minutes 0.0.0.0:9200->9200/tcp, 9300/tcp, 0.0.0.0:9600->9600/tcp, 9650/tcp stupendous_burt

Set Up Repository for AIStor

In order to set up a searchable snapshot index, you need to set up a few prerequisites and configuration to your OpenSearch cluster. We’ll go through them in detail here.

In `opensearch.yaml` create a node and define the node role

node.name: snapshots-node
node.roles: [ search ]

Lets register the AIStor bucket using the `_snapshot` API

curl -XPUT "http://localhost:9200/_snapshot/my-minio-repository" -H 'Content-Type: application/json' -d'
{
"type": "s3",
"settings": {
"bucket": "testbucket123",
"base_path": "my/snapshot/directory"
}
}'

Now that we have the repository, let’s go ahead and create a searchable snapshot.

Searchable Snapshot

In order to make a snapshot, we need to make an API call using the repository we created earlier.

curl -XPUT "http://localhost:9200/_snapshot/my-minio-repository/1"

Let’s check the status of the snapshot

curl -XGET "http://localhost:9200/_snapshot/my-minio-repository/1"
{
"snapshots": [{
"snapshot": "1",
"version": "6.5.4",
"indices": [
"opensearch_dashboards_sample_data_ecommerce",
"my-index",
"opensearch_dashboards_sample_data_logs",
"opensearch_dashboards_sample_data_flights"
],
"include_global_state": true,
"state": "IN_PROGRESS",
...
}]
}

Let’s see this snapshot on the AIStor side as well

root@aj-test-1:~# mc ls testbucket123/my/snapshot/directory
[2023-06-09 17:37:31 UTC] 1.5KiB STANDARD 1/

As you can see above, there is a copy of the snapshot we took using the OpenSearch API in the AIStor bucket.

Now you must be wondering, “we’ve taken the snapshot, but how do we restore it so we can analyze and search against the backed up indices?” Instead of restoring the entire snapshot in the traditional sense, although that is very much possible with OpenSearch, we’ll show you how you can more efficiently search the snapshot while it's still stored on AIStor.

The most important configuration change we need to make is to set `storage_type` to `remote_snapshot`. This setting tells OpenSearch whether the snapshot will be restored locally to be searched or if it will be searched remotely while stored on AIStor.

curl -XPOST "http://localhost:9200/_snapshot/my-minio-repository/1/_restore" -H 'Content-Type: application/json' -d'
{
"indices": "opensearch-dashboards*,my-index*",
"ignore_unavailable": true,
"include_global_state": false,
"include_aliases": false,
"partial": false,

"storage_type": "remote_snapshot",
"rename_pattern": "opensearch-dashboards(.+)",
"rename_replacement": "restored-opensearch-dashboards$1",
"index_settings": {
"index.blocks.read_only": false
},
"ignore_index_settings": [
"index.refresh_interval"
]
}'

Let’s list all the indices to see if the `remote_snapshot` type exists

curl -XGET "http://localhost:9200/my-index/_settings?pretty"

{
"my-index": {
"settings": {
"index": {
"store": {
"type": "remote_snapshot"
}
}
}
}
}

As you can see it's pretty simple to get AIStor configured as an OpenSearch remote repository.

Take Back Your Logs

By leveraging AIStor as a backend for OpenSearch, you can not only take searchable snapshots, but also create non-searchable (also called ‘local’) snapshots and use them as regular backups that can be restored to other clusters for disaster recovery or be enriched with additional data for further analysis.

All that being said, we need to be aware of certain potential pitfalls of using a remote repository for a snapshot location. The access speeds are more or less fixed by the speed and performance of AIStor, which is typically gated by network bandwidth. Please be aware that in a public cloud such as AWS S3, you may be also charged on a per-request basis for retrieval, so users should closely monitor any costs incurred. Searching remote data can sometimes impact the performance of other queries running on the same node. It is generally recommended that engineers take advantage of node roles and create dedicated nodes with the search role for performance-critical applications.

If you have any questions on how to use OpenSearch with AIStor be sure to reach out to us hello@min.io.