Putting the Swarm to Work III

After having the graylog stack working it was time to add some security, even elasticsearch was running on an isolated network, some elements were exposed to the local network such as haproxy or rsyslog, that meant some flaw in their code could compromise the whole deployment.

Also, while I was working with the deployment, I accidentally discovered the fact that my patch for Andrea Tosatto’s docker-swarm’s role was incomplete, so I added a few lines of code and made it able to remove tags with associated values and also to be idempotent. The final patch looks like this:

--- a/atosatto.docker-swarm/tasks/setup-swarm-labels.yml	2020-05-08 13:29:54.000000000 +0200
+++ b/atosatto.docker-swarm/tasks/setup-swarm-labels.yml	2021-07-24 19:40:51.772921624 +0200
@@ -3,7 +3,7 @@
 - name: Get list of labels.
   command: >-
          docker inspect
-         --format {% raw %}'{{ range $key, $value := .Spec.Labels }}{{ printf "%s\n" $key }}{{ end }}'{% endraw %}
+         --format {% raw %}'{{ range $key, $value := .Spec.Labels }}{{ printf "%s=%s\n" $key $value}}{{ end }}'{% endraw %}
          {{ ansible_fqdn|lower }}
   register: docker_swarm_labels
   changed_when: false
@@ -13,17 +13,18 @@
     - swarm_labels
 
 - name: Remove labels from swarm node.
-  command: docker node update --label-rm {{ item }} {{ ansible_fqdn|lower }}
+  command: docker node update --label-rm {{ item.split('=')[0] }} {{ ansible_fqdn|lower }}
   with_items: "{{ docker_swarm_labels.stdout_lines }}"
-  when: item not in swarm_labels
+  when: 
+    - item.replace('=true', '') not in swarm_labels
   delegate_to: "{{ groups['docker_swarm_manager'][0] }}"
   delegate_facts: true
   tags:
     - swarm_labels
 
 - name: Assign labels to swarm nodes if any.
-  command: docker node update --label-add {{ item }}=true {{ ansible_fqdn|lower }}
-  when: item not in docker_swarm_labels.stdout_lines
+  command: docker node update --label-add {{ (item.find('=') > 0) | ternary( item , item ~ "=true") }} {{ ansible_fqdn|lower }}
+  when: (item.find('=') > 0) | ternary( item , item ~ "=true") not in docker_swarm_labels.stdout_lines
   with_items:
     - "{{ swarm_labels  | default([]) }}"
   delegate_to: "{{ groups['docker_swarm_manager'][0] }}"

Security on ES 7.13

Security on ES was what anyone would expect from enterprise grade software: users, roles, fine-grained control over objects, integration with authentication sources like AD or LDAP, etc. The main requisite for being able to require user authentication is to cipher the transport channel, so the first step was to set up transport’s TLS.

Setting up TLS on transport channel

First of all I had to create some certificates to be shared among elasticsearch nodes, so I used a playbook for that, but it can also be done with easyRSA as I explained last week.

The playbook is called certgen.yml and it’s available on my homelab repo amongst the rest of the files.

Then I had to instruct elasticsearch to use security by adding this block to my compose template:

{% if glstackdeploy_es_security == "true" %}
  xpack.security.transport.ssl.enabled: "true"
  xpack.security.transport.ssl.verification_mode: none
  xpack.security.transport.ssl.client_authentication: required
  xpack.security.transport.ssl.key: escert.key
  xpack.security.transport.ssl.certificate: escert.crt
  xpack.security.transport.ssl.key_passphrase: {{ glstackdeploy_es_certificate_password }}
  xpack.security.enabled: "true"
  ELASTIC_PASSWORD: {{ glstackdeploy_es_elastic_password }}
{% endif %}

As certificate files are small and had to be available at any node, docker swarm configurations are the best choice:

configs:
...
{% if glstackdeploy_es_security == "true" %}
  escert:
    file: tls/escert.crt
  escertkey:
    file: tls/escert.key
  ...
{% endif %}

And mounted them on all ES instances as configs:

{% if glstackdeploy_es_security == "true" %}
  configs:
    - source: escert
      target: /usr/share/elasticsearch/config/escert.crt
      mode: 0444
    - source: escertkey
      target: /usr/share/elasticsearch/config/escert.key
      mode: 0444
    ...
{% endif %}

Setting up TLS on the transport channel was pretty straightforward following the instructions on the official documentation.

Setting up users and roles

I was focused on users, roles and privileges, but also wanted to go on the fast lane, so I tried to use the elastic superuser on graylog, it seemed to work: it was able to create the indices, put some data in, but… I was unable to query the data! And it was that way by design, ES developers tried to discourage us of using the superuser for non administrative tasks.

So I had to create a role, then a user through the API, because I didn’t deploy kibana, which offers a web interface for managing users. The queries were based on the official create role doc page

curl -X POST -u "elastic:changeme" "http://node1:9200/_xpack/security/role/graylog?pretty" -H 'Content-Type: application/json' -d'{
  "cluster": [ "monitor", "manage_index_templates" ],
  "indices": [
      {
      "names": [ "gl-events_*", "gl-system-events_*", "graylog_*" ],
      "privileges": [ "all" ]
      }
  ]
  }
'

and the official create users doc page

curl -X POST -u "elastic:changeme" "http://node1:9200/_security/user/grayloguser?pretty" -H 'Content-Type: application/json' -d'
  {
  "password" : "changemetoo",
  "roles" : [ "graylog"],
  "full_name" : "Graylog User"
  }
  '

After changing also graylog’s configuration, I was able to ingest and query records, but I wanted all the deployment of this stack to be automated, I wasn’t happy about the mongodb initialization script, but it worked, so I made an script which span up a curlimage’s docker container and perform the queries, but I had already read about the authentication files on ES, so I tried that way.

File based authentication is pretty straightforward: you have to define your roles on $ES_CONF_PATH/roles.yml, map your users to their roles on a Linux’s /etc/group like file on $ES_CONF_PATH/users_roles and put users' passwords on a /etc/shadow like file on $ES_CONF_PATH/users, with the password encrypted using bcrypt (it’s configurable, but bcrypt is the default).

My roles.yml file was generated using the following jinja2 template:

{{ glstackdeploy_es_graylog_rolename }}:
  cluster: 
    - monitor
    - manage_index_templates
  indices: 
    - names:
      - 'gl-events_*'
      - 'gl-system-events_*'
      - 'graylog_*'
      privileges: 
        - all

The users_roles template was also simple:

{{ glstackdeploy_es_graylog_rolename }}:{{ glstackdeploy_gl_elasticsearch_username }}

And also de users template:

{{ glstackdeploy_gl_elasticsearch_username }}:{{ glstackdeploy_gl_elasticsearch_password | password_hash('bcrypt', {{ glstackdeploy_es_password_salt }}) }}

I used configurations again to make those files available to all nodes:

configs:
...
{% if glstackdeploy_es_security == "true" %}
  ...
  esusers:
    file: conf/es_users
  esroles:
    file: conf/es_roles.yml
  esuserroles:
    file: conf/es_users_roles
{% endif %}

And mounted them on all ES instances like I did with the certificates

{% if glstackdeploy_es_security == "true" %}
  configs:
    ...
    - source: esusers
      target: /usr/share/elasticsearch/config/users
      mode: 0444
    - source: esroles
      target: /usr/share/elasticsearch/config/roles.yml
      mode: 0444
    - source: esuserroles
      target: /usr/share/elasticsearch/config/users_roles
      mode: 0444
{% endif %}

Even though elasticsearch started, it wasn’t accepting the password. After some trial and error, I set up the password using the elasticsearch-user utility, and the main difference was my ansible-generated passwords were starting with $2b$10$ and the one generated by the tool was $2a$10$. I hardcoded the working bcrypt hashed password on my ansible playbook and it worked, so the problem was there.

On python passlib’s bcrypt function, there is an ident parameter that allows invokers choose between using $2a$ or $2b$ hashes, but its not working in ansible yet, and I said yet because I pried into’s ansible’s builtin filters and the changes are already committed, so it will be available soon. In the meanwhile, I needed something to get $2a$ hashes, and I knew how to do it in python, so I hashed the password on the control node.

  vars:
    local_python_interpreter: /usr/bin/python3
  pre_tasks:
    - name: run python to crypt
      command:
        cmd: "{{ local_python_interpreter }} -c 'from passlib.hash import bcrypt; print(bcrypt.hash(\"{{ glstackdeploy_gl_elasticsearch_password }}\", rounds=10, ident=\"2a\"))'"
      delegate_to: localhost
      run_once: yes
      register: crypted

    - name: set fact output
      set_fact:
        glstackdeploy_gl_elasticsearch_password_crypt: "{{ crypted.stdout_lines[0] }}"
      when: crypted is defined

And then I changed the users jinja2 template to use the resulting hash.

{{ glstackdeploy_gl_elasticsearch_username }}:{{ glstackdeploy_gl_elasticsearch_password_crypt }}

And then I had a graylog stack ready to use.

Conclusions

My objective when I started this exercise was to test my changes to the docker-swarm role on CentOS7 before submitting them for inclusion, but I discovered it wasn’t working correctly because I got an unexpected changed result running the playbook.

I also thought that enabling authentication was going to be a piece of cake because I had already done the tls part, but then I found out I wasn’t allowed to use the elastic user for graylog, and the lab turned into a complete dive into elasticsearch authentication mechanisms.

But it was a challenging and rewarding exercise, and it opened the gates for testing ES7 new features only available when xpack is enabled.

And, of course, the relevant files are on my homelab repo