Renewing KES certificate

MiniO KES (Key Encryption Service) is a service developed by MinIO to bridge the gap between applications that run in Kubernetes and a centralized Key Management Service (KMS). The central KMS server contains all the state information, while the KES talks to the KMS when it is required to do any operation related to fetching new keys or updating existing ones. Once it fetches a key, as long as it doesn’t need to be updated or deleted, it will be cached in KES so the subsequent calls will be much faster.

So why use KES rather than directly using the KMS? Depending on the KMS used and the load it needs to handle, sometimes KMS systems do not have the capability or the support to handle large deployments where it has to manage hundreds if not thousands of keys back and forth while the Kubernetes cluster puts an enormous load on them. In these situations, it's paramount you use KES because it can scale horizontally very easily, unlike traditional KMS systems.

All KES operations between the Application <-> KES and between KES <-> KMS use mTLS authentication for authentication and authorization functions. This is done using a pair of public/private keys and X.509 certificate. The thing with certs is they have a very common problem, they tend to expire and when they do, services all around fail spectacularly with little rhyme or reason. What do we mean by that?

What we mean is once the cert expires, you will start to see errors such as these in the KES log

{"message":"2024/01/04 02:23:21 http: TLS handshake error from 10.244.2.9:32816: remote error: tls: bad certificate"}

{"message":"2024/01/04 02:23:28 http: TLS handshake error from 10.244.3.11:53456: remote error: tls: bad certificate"}

{"message":"2024/01/04 02:23:28 http: TLS handshake error from 10.244.1.9:56722: remote error: tls: bad certificate"}

{"message":"2024/01/04 02:23:28 http: TLS handshake error from 10.244.4.11:34152: remote error: tls: bad certificate"}

{"message":"2024/01/04 02:23:28 http: TLS handshake error from 10.244.2.9:55300: remote error: tls: bad certificate"}

{"message":"2024/01/04 02:23:28 http: TLS handshake error from 10.244.4.11:34160: remote error: tls: bad certificate"}

Also, when MinIO tries to do a periodic IAM refresh, those would also fail with the following messages in the MinIO log

Error: Failure in periodic refresh for IAM (took 0.03s): Post "https://kes-tenant-kes-hl-svc.default.svc.cluster.local:7373/v1/key/decrypt/my-minio-key": x509: certificate has expired or is not yet valid: current time 2024-01-04T02:27:31Z is after 2024-01-04T02:12:40Z (*errors.errorString)

If you are lucky, you will see an obvious message such as certificate has expired. Other times it's not so obvious, you could also see edge case issues when trying to create or delete keys among a host of other issues. The quickest solution is to renew and update KES with new certs as soon as possible. In this post we’ll show you exactly how to do that.

How to Renew

Let’s first start by creating a new private key

openssl genrsa -out private.key 2048

Create a file called cert.cnf which will be used by openssl to create the Certificate Signing Request (CSR)

[req]

distinguished_name = req_distinguished_name

req_extensions = req_ext

prompt = no


[req_distinguished_name]

O = "system:nodes"

C = US

CN  = "system:node:*.kes-tenant-kes-hl-svc.default.svc.cluster.local"


[req_ext]

subjectAltName = @alt_names


[alt_names]

DNS.1 = kes-tenant-kes-0.kes-tenant-kes-hl-svc.default.svc.cluster.local

DNS.2 = kes-tenant-kes-hl-svc.default.svc.cluster.local

Be sure to modify the Common Name CN and Subject Alternative Names SAN (under [alt_names]) to match the FQDN of your KES nodes. Be sure to use proper FQDNs and not IP addresses.

Create the CSR using the command below

openssl req -new -config cert.cnf -key private.key -out kes.csr

Convert the CSR into an encoded string so it can be added to Kubernetes as a CertificateSigningRequest resource.

cat kes.csr | base64 | tr -d "\n"

Create a file kes-csr.yaml with the content below and paste the above encoded CSR in the request field. The cert has been truncated so you can see the entire yaml.

apiVersion: certificates.k8s.io/v1

kind: CertificateSigningRequest

metadata:

  name: kes-csr

spec:

  expirationSeconds: 604800

  groups:

  - system:serviceaccounts

  - system:serviceaccounts:minio-operator

  - system:authenticated

  - system:nodes

  request: LS0tLS1CRUdJTiBDRV…FUVVFU1QtLS0tLQo=

  signerName: kubernetes.io/kubelet-serving

  usages:

  - digital signature

  - key encipherment

  - server auth

  username: system:serviceaccount:minio-operator:minio-operator

Be sure to update the expirationSeconds to something high so that it doesn’t expire very soon.

Once the encoded CSR has been added and other settings have been set, apply the yaml.

kubectl apply -f kes-csr.yaml

Be sure to approve the kes-csr CSR created above

kubectl certificate approve kes-csr

Get a public cert from the csr resource

kubectl get csr kes-csr -o jsonpath='{.status.certificate}'| base64 -d > public.crt

Convert both the private.key (from the beginning of the process) and public.crt (from the previous step) to an encoded string.

cat private.key | base64 | tr -d "\n"

cat public.crt | base64 | tr -d "\n"

Using the encoded strings from above, we’ll update the existing Secret kes-tenant-kes-tls, in order to do that, follow the steps below.

Copy the existing secret where the existing expired cert is located.

kubectl get secret kes-tenant-kes-tls -o yaml > kes-tls-secret.yaml

Once you have backed up the existing secret, delete it

kubectl delete secret kes-tenant-kes-tls

Open kes-tls-secret.yaml with the expired certs and replace the following two fields with their respective base64 encoded strings.

data:

  private.key: >-

LS0tLS1CRUd…ZLS0tLS0

  public.crt: >-

LS0tLS1CRUdJTi…tLS0K

Once the new certs have been added apply the Secret, which will recreate kes-tenant-kes-tls

kubectl apply -f kes-tls-secret.yaml

Once a valid cert is added, be sure to restart the KES service and you should see the output like so:

'http://vault.default.svc.cluster.local:8200' ...

Endpoint: https://127.0.0.1:7373    https://10.244.4.16:7373  


Admin: _ [ disabled ]

Auth: off   [ any client can connect but policies still apply ]


Keys: Hashicorp Vault: http://vault.default.svc.cluster.local:8200


CLI:  export KES_SERVER=https://127.0.0.1:7373

       export KES_CLIENT_KEY=   // e.g. $HOME/root.key

       export KES_CLIENT_CERT=  // e.g. $HOME/root.cert

       kes --help

The MinIO log should also be clean and should not show any TLS errors anymore.

Waiting for all MinIO sub-systems to be initialized.. lock acquired

Automatically configured API requests per node based on available memory on the system: 221

All MinIO sub-systems initialized successfully in 15.44125ms

MinIO Object Storage Server

Copyright: 2015-2024 MinIO, Inc.

License: GNU AGPLv3

Version: RELEASE.2024-01-04T09-40-09Z (go1.19.4 linux/arm64)


Status:     4 Online, 0 Offline.

API: https://minio.default.svc.cluster.local

Console: https://10.244.3.12:9443 https://127.0.0.1:9443   


Documentation: https://min.io/docs/minio/linux/index.html

Final Thoughts

KES is an integral part when it comes to managing keys  used to encrypt objects. It is important that objects get encrypted and decrypted in the quickest manner possible because each nanosecond it takes to perform these operations the end user will get the objects that much slower. Yes, ultimately, inefficient and slow KMS systems can degrade overall cluster performance. So it's paramount to ensure the service performing these actions is fast, lean, performant and scalable. MinIO’s KES enables any KMS to be a high performance and scalable service without any modification to the existing KMS. By following the steps above, you can get KES back to having valid unexpired certs in no time!

If you have any questions on KES be sure to reach out to us on Slack!