Elasticsearch on docker

Happy new year folks! It’s been three weeks since my last post, I’ve to recognize I took some days off for Christmas.

Last days I switched my focus because of some work-related needs, and I went back to docker to design and deploy an Elasticsearch cluster according with some requirements.

The task

The cluster has to be deployed using some already purchased hardware, split across two hardware nodes which would host 6 elasticsearch nodes each in a hot-warm design. These nodes shall join an existent cluster and being able to accommodate the existent indices in order to allow the removal of the original nodes.

And all of this should be working in production today, the day after I went back to work.

Preparation

I have a lot of experience configuring elasticsearch, and also a lot of experience deploying graylog + elasticsearch using docker-compose as lab environment for testing, so the challenge here was how to do that in production in a working day.

And I ended up doing a shell script which:

  1. Installs some packages using yum.
  2. Downloads the latests docker-compose release.
  3. Downloads docker images for elasticsearch and cerebro.
  4. Creates a docker-compose.yaml file, and the folder hierarchy.
  5. Populates /etc/fstab with bind mounts for the real filesystems.

I was interested on finding out about some templating on compose files, but until this time, every service defined was different of each other, so there was no need, but in this case all services were to share a lot so after some googling, I learned about YAML Anchors and how to use them in compose files using extension fields, this guaranteed me at least one #learnbydoing on this project.

I tested it on my home computer using a couple of Virtualbox VMs, and after some fine tuning I got it working smoothly.

But then I realize I was using official elasticsearch-5.16 images and they come with x-pack preinstalled, you can disable some features using environment variables, but it’s not removed and you also need to register at least for a free license and keep track of the license renewal.

We are using two additional plugins: Dataformat and mapper-size, so the official image didn’t fit my needs, so I pulled the elasticsearch-docker github repo, made my changes to the Dockerfile template and built my own image, wich is available on dockerhub as juanjovlc2/elasticsearch-oss.

When I was back at work yesterday I copied my scripts to the hardware where my colleague Oscar previously installed CentOS 7, wrote in the real ip addresses and in a couple of minutes I had the servers ready to run the elasticsearch nodes using docker.

The startup test

Although this solution have been thoroughly tested on my home lab, I wanted to be sure all was in place before messing around with existent data.

So I changed the cluster-name on my docker-compose.yaml file and put a couple of direct rules on firewalld to prevent any outgoing communication to the existent nodes.

Then I launched docker-compose up -d && docker-compose logs -f and enjoyed the smooth starting of my new nodes.

Adding monitoring

The new nodes are production stuff, so I some monitoring should be in place, I’m talking about availability monitoring, observability will be the next step.

At my company we are using Nagios, and the best plugin for docker monitoring is Tim Laurence’s check_docker.

If you are using NRPE, I suggest ussing the --present --status running params in your command definition, to prevent getting UNKNOWN status responses.

I used:

command[check_container]=/usr/lib64/nagios/plugins/check_docker --present --status running --cpu 70:80 --containers $ARG1$ --memory $ARG2$:GB

I hardcoded the cpu thresholds because they are always a percentual check, but I wanted the amount of memory in absolute numbers in order to graph the memory consumption.

If you are using CentOS 7 and didn’t create the docker group before starting docker daemon, your /var/run/docker.sock will be owned by root:root, instead of root:docker, then you should use sudo on your command definition and add the corresponding rule to your sudoers file.

echo 'nrpe ALL=(ALL) NOPASSWD:/usr/lib64/nagios/plugins/check_docker' >> /etc/sudoers.d/nrpe

And if this is the first time you use sudo in a nrpe command definition, don’t forget to enable the SELinux boolean, another #learnbydoing experience thanks to sealert.

setsebool -P nagios_run_sudo 1

This requires the nrpe-selinux package.

Connecting the new nodes with the old ones

The moment of truth came and I had to break the don’t change anything on production on fridays rule, because the deadline had to be met so I brought down the containers, clear all local data, put the real cluster name on the compose file and clear the firewalld rules.

I’ve scaled up several elasticsearch clusters, and the automatic shard rebalancing is very helpful, but in this case I’m introducing a hot-warm design to reduce costs while extending data retention, and allocation awareness to prevent data loss in case of a hardware failure, so I disabled shard balancing and brought up the new nodes.

It worked.

Once all the nodes recognized the others as peers it was time to move the data. First I moved the oldest index to the warm zone and checked the data was accessible, so I moved the second oldest one to the hot zone, it worked to. The next step was a little more scary, I moved the active write index, the graylog_deflector, to the hot zone keeping an eye on the ingest rate, it worked too.

Then I changed the index template so new indices created by graylog were created on the hot zone, configured the allocation awareness attribute on the cluster and marked the remaining indices to be moved to the hot zone, and waited for elasticsearch to move all the shards.

In the meanwhile I tested the nagios notifications for the container checks.

It was past my normal working hours and the active writing index hadn’t rotated yet, so I rotated it manually from graylog interface, check were the shards were created on the hot zone from the cerebro interface and they were on the hot zone.

Conclusion

  • cluster up
  • data moved to new nodes
  • hot-warm working
  • allocation awareness working
  • index template working
  • containers monitored
  • notifications working
  • DEADLINE MET!

Let’s call it a day!