topic Re: Splunk logging Driver Bringing Down the Entire Docker Swarm Cluster in Getting Data In

Splunk logging Driver Bringing Down the Entire Docker Swarm Cluster

eygtmbot — Mon, 23 Apr 2018 23:20:48 GMT

Hello,

We implemented collecting Docker logs using splunk logging driver, It pushes the docker logs very well and good. But we have a bigger problem now.

Let's consider my Splunk-Indexor is down while spinning up docker containers, those containers will not be able to establish the connection with Splunk-indexor machine. Now that's going to crash entire docker engine on the system and you will not be able to execute any of the docker commands in those machines, this will hang up the entire docker engine in the machine. To fix this I had to restart the VM, docker service restart is not helping.

How can I mitigate this error?

Is this the docker issue or the Splunk one?

Here is the swarm-stack file I'm using

version: '3'
services:
  worker:
    image: "${DOCKER_IMAGE_PATH}/worker:${RELEASE_TAG}"
    deploy:
      replicas: 3
    build:
      context: ../../
      dockerfile: ../Dockerfile-worker
    environment:
    ports:
      - "8083:3000"
    logging:
       driver: splunk
       options:
          splunk-url: "${SPLUNK_URL}"
          splunk-token: "${SPLUNK_TOKEN}"
          splunk-insecureskipverify: "true"
          tag: "{{.Name}}/{{.ID}}"
          labels: "NEurope"
          env: "${TARGET_NAME}"

If the Splunk driver works like this, then I need to rebuild/restart Docker Containers each and every time if there is a restart on the Splunk server(Indexor)

Thanks,
Kiran

Re: Splunk logging Driver Bringing Down the Entire Docker Swarm Cluster

outcoldman — Tue, 24 Apr 2018 01:15:18 GMT

Do you run your Splunk Indexer at the same Docker Swarm from where you are sending logs? Possible you want to separate infra and prod clusters.

It is unexpected that after Splunk Indexer restart you see crashes or hangs. This behavior is not expected and should be reported on docker repository http://github.com/moby/moby

If you have only one Indexer - I would suggest you create a fleet of Splunk Heavy Weight Forwarders, see http://dev.splunk.com/view/event-collector/SP-CAAAE73, that way when you will need to restart Splunk Cluster - you will be able to restart it one by one.

If you don’t mind paid solutions, I can suggest to use our solution for Monitoring and Logs Forwarding https://www.outcoldsolutions.com/, where we implemented logs forwarding on top of default JSON logging driver, so we have no affect on Docker Swarm. Plus to that you will get application monitoring. You can find how to install our solution here https://www.outcoldsolutions.com/docs/monitoring-docker/ you can try it for free, as our images have a built-in trial license.

Re: Splunk logging Driver Bringing Down the Entire Docker Swarm Cluster

eygtmbot — Tue, 24 Apr 2018 22:19:26 GMT

Hello,
No. I'm not running Splunk indexer machine on the swarm cluster, that is a stand-alone machine sitting outside the cluster.

I believe this is happening because we have some timeouts on the Splunk-indexer machine.

I noticed that I can see some timeout error on the docker engine logs, Is the docker is going to hang on each and every timeout?

Even if you set up a cluster with multiple heavy forwarders, that is not going to help, Because you may have timeout because of the network.

Please let me know if you have any thoughts...!

We are already in a process to procure Splunk, at this moment we don't have direct support.

Thanks,
Kiran

Re: Splunk logging Driver Bringing Down the Entire Docker Swarm Cluster

outcoldman — Wed, 25 Apr 2018 02:19:03 GMT

Having multiple indexer will help with the indexer availability, but will not solve the networking problem. You can also have Heavy Weight Forwarders installed on the same node, so you will not have networking issues anymore. And that forwarders will send data to indexers, when they are available.

The hang you are experiencing is unexpected, and I assume that it is possible that Splunk Logging Driver does not set the read timeout, and the connection is just getting disconnected from one end but does not close it on Splunk Logging Driver, so it indefinitely waits for a response. It does not seem like Splunk Logging Driver sets the ReadTimeout to the http.Client https://github.com/moby/moby/blob/master/daemon/logger/splunk/splunk.go#L223, so you can send a PR to add a timeout https://golang.org/pkg/net/http/#Client

That should solve this problem partially.

But again, I will suggest you take a look on our solution, as our log forwarding does not depend on Splunk log driver, you will write the logs in JSON, our collector tails JSON logs and forwards them to Splunk. We have a free trial for 30 days. Give a try, send us an email to sales@outcoldsolutions.com to learn more, we can schedule a call and discuss all the issues you experience.