Howdy,
Recently we had a network incident that caused out indexers to be unavailable for 15+ hours. A change was made on a firewall that caused all inbound ICMP traffic the indexers to be dropped. So for that period of time none of the indexers got any data from the forwarders. The forwarders did the right thing and spooled up the data that they were collecting and then delivered it to the indexers when they were available again, so we lost no data, except the data collected by the *nix app.
Nine of our machines failed to collect any stats with the *nix app scripts starting about 15 minutes after the firewall was misconfigured and then started up again about 15 minutes after the firewall configuration was corrected. A very small amount of log data was collected based on the *nix apps inputs.conf (about 15 lines from all 9 hosts).
To review:
So my questions are:
The forwarders are all running the Universal Forwarder 4.3
Indexers are running Splunk 4.3.1
The comment about ICMP confuses me. Splunk forwarders should not need ICMP to be able to reach an indexer. If you'd said TCP then I'd be less confused.
Anyways, this should be (somewhat) expected. Each forwarder has a limited buffer in which to hold data bound for an indexer. It is possible for scripted inputs or network inputs at a forwarder to drop data if those buffers become completely consumed. Settings like indexer acknowledgement and persistent queues help with this, but would probably fail for different reasons during a 15 hour problem.
Normal file tailing (monitor://
stanzas) are basically unaffected by this, because there are hooks to make sure that the TailingProcessors don't overrun your buffers and they remember how much of each file has been sent. (Or something similar to that) This, too, can fall apart over an EXTENDED outage. (Say longer than your log rotation/archiving interval)
The docs discuss persistent queues much better than I can at @ http://docs.splunk.com/Documentation/Splunk/4.3.1/Data/Usepersistentqueues
The comment about ICMP confuses me. Splunk forwarders should not need ICMP to be able to reach an indexer. If you'd said TCP then I'd be less confused.
Anyways, this should be (somewhat) expected. Each forwarder has a limited buffer in which to hold data bound for an indexer. It is possible for scripted inputs or network inputs at a forwarder to drop data if those buffers become completely consumed. Settings like indexer acknowledgement and persistent queues help with this, but would probably fail for different reasons during a 15 hour problem.
Normal file tailing (monitor://
stanzas) are basically unaffected by this, because there are hooks to make sure that the TailingProcessors don't overrun your buffers and they remember how much of each file has been sent. (Or something similar to that) This, too, can fall apart over an EXTENDED outage. (Say longer than your log rotation/archiving interval)
The docs discuss persistent queues much better than I can at @ http://docs.splunk.com/Documentation/Splunk/4.3.1/Data/Usepersistentqueues
The blocking of inbound ICMP packets was what was being said around the water cooler and not an official diagnosis of what changed on the firewall. I'm still waiting on that.
The above is an awesome answer! What's still confusing me is that only 9 of 67 hosts didn't store that data for later forwarding. Now I just have to figure out what is different about those 9 hosts.