Deployment Architecture

Docker image search cluster configuration fails in splunk-ansible: 'FAILED - RETRYING: Destructive sync search head'

Explorer

We're using the docker images at https://hub.docker.com/r/splunk/splunk to install splunk in kubernetes. We're currently using 7.2.4, and are preparing to upgrade to 7.2.9.1.

The configuration stage (using splunk-ansible) of the search cluster is failing for at least the following versions:

  • 7.2.9
  • 7.3.3

The log for each of the search cluster members shows:

FAILED - RETRYING: Destructive sync search head

We have tested the following versions and found that they do not exhibit this behaviour, and deploy a working search cluster:

  • 7.2.4
  • 7.2.5
  • 7.2.6
  • 7.2.7

(7.2.8 is broken in a totally different way; all the containers die almost immediately with 'ERROR: Couldn't read "/opt/splunk/etc/splunk-launch.conf" ')

My question is - does anyone here have 7.2.9 or 7.3.3 working using the docker containers and with a search cluster, and can they please share the secret?

Thanks,
Rich

Motivator

I can at least solve the 'ERROR: Couldn't read "/opt/splunk/etc/splunk-launch.conf' error for you.

The Docker image assumes the Splunk user has r/w access to the /opt directory. Unless you're running Splunk as root, or a user with sudo privs, this is almost never the case. In order to resolve it you'll need to update the Makefile and rebuild the image. Below are the settings that I currently use, you can diff them against yours (or defaults).

Note: my build is based on Centos 7.6 and may vary on other flavors. This is not the complete Makefile, but contains the lines needed to resolve the error you are getting. (and my IP is redacted, obviously)

ENV SPLUNK_HOME /opt/splunk
ENV SPLUNK_GROUP splunk
ENV SPLUNK_USER splunk
ENV SPLUNK_BACKUP_DEFAULT_ETC /var/opt/splunk
ARG CENTOS_FRONTEND=noninteractive

# add splunk:splunk user
RUN groupadd -r ${SPLUNK_GROUP} \
    && useradd -r -m -g ${SPLUNK_GROUP} ${SPLUNK_USER}

# make the "en_US.UTF-8" locale so splunk will be utf-8 enabled by default
ENV LANG en_US.utf8

# Download Splunk release from local server, it is too big to be part of the repo
# Also backup etc folder, so it will be later copied to the linked volume
RUN mkdir -p ${SPLUNK_HOME} \
    && wget -qO /tmp/${SPLUNK_FILENAME} http://xx.xxx.xx.xx/splunk/${SPLUNK_FILENAME} \
    && tar xzf /tmp/${SPLUNK_FILENAME} --strip 1 -C ${SPLUNK_HOME} \
    && rm /tmp/${SPLUNK_FILENAME} \
    && rm /tmp/${SPLUNK_FILENAME}.md5 \
    && mkdir -p /var/opt/splunk \
    && cp -R ${SPLUNK_HOME}/etc ${SPLUNK_BACKUP_DEFAULT_ETC} \
    && rm -fR ${SPLUNK_HOME}/etc \
    && chown -R ${SPLUNK_USER}:${SPLUNK_GROUP} ${SPLUNK_HOME} \
    && chown -R ${SPLUNK_USER}:${SPLUNK_GROUP} ${SPLUNK_BACKUP_DEFAULT_ETC} \
COPY entrypoint.sh /sbin/entrypoint.sh
RUN chmod +x /sbin/entrypoint.sh

# Ports Splunk Web, Splunk Daemon, KVStore, Splunk Indexing Port, Network Input, HTTP Event Collector
EXPOSE 8000/tcp 8089/tcp 8191/tcp 9997/tcp 1514 8088/tcp

WORKDIR /opt/splunk

# Configurations folder, var folder for everything (indexes, logs, kvstore)
VOLUME [ "/opt/splunk/etc", "/opt/splunk/var" ]

These are basically the two most critical lines from above:

 && chown -R ${SPLUNK_USER}:${SPLUNK_GROUP} ${SPLUNK_HOME} \
 && chown -R ${SPLUNK_USER}:${SPLUNK_GROUP} ${SPLUNK_BACKUP_DEFAULT_ETC} \

Run docker build and redeploy your image. This should fix it for you.

0 Karma

Explorer

@codebuilder thanks for your answer - does that only apply for 7.2.8, or will it work for 7.2.9 or 7.3.3?

0 Karma

Motivator

It should apply to all versions.

0 Karma

Motivator

It will likely resolve your k8s clustering issues as well. But if not, don't forget to expose your deployment via ingress or NodePort. I prefer the latter.

0 Karma

I does NOT work for me on 7.3.3 either. It fails running this ansible task (splunk-ansible/roles/splunk_common/tasks/wait_for_splunk_instance.yml) :


  • name: Check Splunk instance is running uri: url: "{{ cert_prefix }}://{{ splunk_instance_address }}:{{ splunk.svc_port }}/services/server/info?output_mode=json" method: GET user: "{{ splunk.admin_user }}" password: "{{ splunk.password }}" validate_certs: false register: task_response until:
    • task_response.status == 200
    • lookup('pipe', 'date +"%s"')|int - task_response.json.entry[0].content.startup_time > 10 retries: "{{ retry_num }}" delay: 30 ignore_errors: true no_log: "{{ hide_password }}"
0 Karma

Motivator

Rebuild your image using the Makefile suggestions in my reply below. It will correct any permission issues and expose the necessary ports.

0 Karma

Explorer

We were very much hoping to use the official splunk images, so we can avoid the support burden. However if we can't get them working, I will do, thanks 🙂

0 Karma

Explorer

I think the problem is related to the mgmt_uri parameter in server.conf.

In 7.2.4:

root@search-0:/opt/splunk# grep mgmt_uri etc/system/local/server.conf
mgmt_uri = https://search-0.search:8089

In 7.2.9:

root@search-0:/opt/splunk# grep mgmt_uri etc/system/local/server.conf
mgmt_uri = https://search-0.search.splunk-mycompany-internal-stg-3.svc.cluster.local:8089

And also in 7.2.9:

root@search-0:/opt/splunk# grep ERROR var/log/splunk/splunkd.log  | tail -n 1
12-11-2019 12:07:33.953 +0000 ERROR SHCRaftConsensus - Mismatch in mgmt_uri and server URI provided to LEADER. Check URI strings in set_configuration mgmt_uri = https://search-0.search.splunk-mycompany-internal-stg-3.svc.cluster.local:8089 remote_server_name = https://search-0.search:8089

So the problem becomes - how do we set this correctly during the container creation process, given that the configmap used by the deploy container doesn't seem to be able to do that.

0 Karma

Explorer

The problem still exists in the 7.2.9.1 image. Does anyone use these images?

0 Karma
Don’t Miss Global Splunk
User Groups Week!

Free LIVE events worldwide 2/8-2/12
Connect, learn, and collect rad prizes
and swag!