Deployment Architecture

*nix app scripts stopped when indexers are not available?

Path Finder

Howdy,

Recently we had a network incident that caused out indexers to be unavailable for 15+ hours. A change was made on a firewall that caused all inbound ICMP traffic the indexers to be dropped. So for that period of time none of the indexers got any data from the forwarders. The forwarders did the right thing and spooled up the data that they were collecting and then delivered it to the indexers when they were available again, so we lost no data, except the data collected by the *nix app.

Nine of our machines failed to collect any stats with the *nix app scripts starting about 15 minutes after the firewall was misconfigured and then started up again about 15 minutes after the firewall configuration was corrected. A very small amount of log data was collected based on the *nix apps inputs.conf (about 15 lines from all 9 hosts).

To review:

  • Indexers go off the air.
  • Forwarders continue to spool data collected for later delivery.
  • Forwards stop collecting *nix based data, but continue to collect other data
  • Indexers come back on the air.
  • Forwarders deliver spooled data.
  • Forwarders start collecting *nix based data again.

So my questions are:

  • Is there something in either Splunk or the *nix app that would cause this behavior?
  • If there isn't a something known (like a setting) that would have caused this, what did?

The forwarders are all running the Universal Forwarder 4.3

Indexers are running Splunk 4.3.1

0 Karma
1 Solution

SplunkTrust
SplunkTrust

The comment about ICMP confuses me. Splunk forwarders should not need ICMP to be able to reach an indexer. If you'd said TCP then I'd be less confused.

Anyways, this should be (somewhat) expected. Each forwarder has a limited buffer in which to hold data bound for an indexer. It is possible for scripted inputs or network inputs at a forwarder to drop data if those buffers become completely consumed. Settings like indexer acknowledgement and persistent queues help with this, but would probably fail for different reasons during a 15 hour problem.

Normal file tailing (monitor:// stanzas) are basically unaffected by this, because there are hooks to make sure that the TailingProcessors don't overrun your buffers and they remember how much of each file has been sent. (Or something similar to that) This, too, can fall apart over an EXTENDED outage. (Say longer than your log rotation/archiving interval)

The docs discuss persistent queues much better than I can at @ http://docs.splunk.com/Documentation/Splunk/4.3.1/Data/Usepersistentqueues

View solution in original post

SplunkTrust
SplunkTrust

The comment about ICMP confuses me. Splunk forwarders should not need ICMP to be able to reach an indexer. If you'd said TCP then I'd be less confused.

Anyways, this should be (somewhat) expected. Each forwarder has a limited buffer in which to hold data bound for an indexer. It is possible for scripted inputs or network inputs at a forwarder to drop data if those buffers become completely consumed. Settings like indexer acknowledgement and persistent queues help with this, but would probably fail for different reasons during a 15 hour problem.

Normal file tailing (monitor:// stanzas) are basically unaffected by this, because there are hooks to make sure that the TailingProcessors don't overrun your buffers and they remember how much of each file has been sent. (Or something similar to that) This, too, can fall apart over an EXTENDED outage. (Say longer than your log rotation/archiving interval)

The docs discuss persistent queues much better than I can at @ http://docs.splunk.com/Documentation/Splunk/4.3.1/Data/Usepersistentqueues

View solution in original post

Path Finder

The blocking of inbound ICMP packets was what was being said around the water cooler and not an official diagnosis of what changed on the firewall. I'm still waiting on that.

The above is an awesome answer! What's still confusing me is that only 9 of 67 hosts didn't store that data for later forwarding. Now I just have to figure out what is different about those 9 hosts.

0 Karma