Now this could be a case of RTFM, but I can't find this in TFM 🙂
I am trying to find some documentation on what the Universal Forwarder does when it can't connect to an indexer for various scenarios.
I am aware of indexer acknowledgement however that is really not what I want to know. I want to know how the UF will behave in various scenarios
http://docs.splunk.com/Documentation/Splunk/6.4.1/Forwarding/Protectagainstlossofin-flightdata
So it is well documented that for TCP, UDP and script inputs.conf stanzas there are features for buffers and queues
queueSize = <integer>[KB|MB|GB]
* Maximum size of the in-memory input queue.
* Defaults to 500KB.
persistentQueueSize = <integer>[KB|MB|GB|TB]
* Maximum size of the persistent queue file.
* Defaults to 0 (no persistent queue).
* If set to some value other than 0, persistentQueueSize must be larger than
the in-memory queue size (set by queueSize attribute in inputs.conf or
maxSize settings in [queue] stanzas in server.conf).
* Persistent queues can help prevent loss of transient data. For information on
persistent queues and how the queueSize and persistentQueueSize settings
interact, see the online documentation.
Ref: http://docs.splunk.com/Documentation/Splunk/latest/Admin/Inputsconf
If the indexer is not available, these queues will gradually fill until the indexer comes back online or the queues are full at which point other settings determine the behaviour of the UF.
What I can't find any documentation on is what happens to inputs for file monitor, WinEventLog, perfmon etc... when the indexer goes offline.
I'm confident that the indexing of these inputs pauses until the indexer returns. There is no need for a buffer because these events are already stored on the filesystem or within the OS. However I would like to find some documentation to show a customer who wants to know what the official behaviour is in this situation.
Then there is the other concern of rotating logfiles (while the indexer is offline) which is addressed in another Splunk answer (which I can't find right now)
When the indexer goes offline, if you're using indexer acknowledgement, those files and wineventlogs will be sent to the indexer until they are acknowledged as being received. So if your forwarder dies out for some reason and data in the in-memory queue goes with it, that data will be resent to ensure data is all present and accounted for.
In regards to to indexer failure, having multiple indexers and auto load balancing or an index cluster can help safeguard from an indexer failure.
This is explained in one of the documents you sent here: http://docs.splunk.com/Documentation/Splunk/6.4.1/Forwarding/Protectagainstlossofin-flightdata#Why_t...
When the indexer goes offline, if you're using indexer acknowledgement, those files and wineventlogs will be sent to the indexer until they are acknowledged as being received. So if your forwarder dies out for some reason and data in the in-memory queue goes with it, that data will be resent to ensure data is all present and accounted for.
In regards to to indexer failure, having multiple indexers and auto load balancing or an index cluster can help safeguard from an indexer failure.
This is explained in one of the documents you sent here: http://docs.splunk.com/Documentation/Splunk/6.4.1/Forwarding/Protectagainstlossofin-flightdata#Why_t...
Only problem is that doesn't gel with real world experience. The data is not getting lost when acknowledgements are disabled.
I have a forwarder on an internal server (host=media) at the office and a development Splunk instance on my laptop (192.168.64.62 the indexer) which receives data from this forwarder.
I was out of the office for 7 days last week. I just checked the _internal logs from this forwarder on my development instance and they are all there even the data that was logged while the indexer was offline/inaccessible.
The default inputs.conf on the UF for internal logs (via btool)
[monitor:///opt/splunkforwarder/var/log/splunk]
_rcvbuf = 1572864
disabled = true
host = media
index = _internal
source = stream
[monitor:///opt/splunkforwarder/var/log/splunk/metrics.log]
_TCP_ROUTING = *
_rcvbuf = 1572864
disabled = true
host = media
index = _internal
source = stream
[monitor:///opt/splunkforwarder/var/log/splunk/splunkd.log]
_TCP_ROUTING = *
_rcvbuf = 1572864
disabled = true
host = media
index = _internal
source = stream
And as you can see when I look at my indexer all the data is there nothing was lost. There are about 2000 events per hour so I would assume any queue would have exhausted even though there are no queues in file monitors in inputs.conf.
http://imgur.com/byIQpxK
Edit: Even curiouser those stanzas are marked as disabled but the data is definitely coming through???
Alright I modified my original response so that it is a bit more clear, but let me also expand as I think there are some other things we can clear up here for anyone else that might be reading this.
Regarding the disabled inputs, what did you run to obtain that output? Can you try the following since there are internal logs coming in, there is some monitoring going on somewhere:
$SPLUNK_HOME/bin/splunk cmd btool inputs monitor
As for the data you saw even though your indexer was down for a period of time, Splunk Universal Forwarders also have queues (http://docs.splunk.com/Documentation/Splunk/6.4.1/Deploy/Componentsofadistributedenvironment). The queueSize that you brought up is an in-memory queue. Since your indexer was down, the Universal Forwarder's Input queue was blocked up because data couldn't get to the indexer. When the indexer came back up (since your Universal Forwarder never went down) the data in-memory was able to get out and you never lost any data. In addition, new data that had never been sent started to flow into the input queue and played catch up for the data that had been missed during the indexer outage.
This is why Splunk has multiple time fields. One is an _indextime (the time the data was actually indexed) and the _time (the time of the event). When you look at the timeline in your image, you will notice that your data is making sense in chronological order. However the time field you don't typically look at is the index time. To visually see an example of this, the following search will show latency (difference) between indexing time and event time for your splunkd.log events. A high latency means that a specific event had a distinct difference between the actual event time and when it got indexed. I bet if you run the following search over that time frame, you'll see some high max latency indicating that your system was playing "catch up" for events it had missed.
index=_internal source=*splunkd.log host=**<universal_forwarder_hostname>** | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | eval time=_time | eval itime=_indextime | eval latency=(itime - time) |stats count, avg(latency), min(latency), max(latency) by host
Also as an added bonus, here is a nice conf presentation on some of the data pipeline stuff, https://conf.splunk.com/session/2014/conf2014_AmritBathJagKerai_Splunk_WhatsNew.pdf
Thanks for the discussion this is the sort of thing that should really be documented in detail I would have thought. I've done some tests to come later but first to your questions.
[root@media ~]# /opt/splunkforwarder/bin/splunk cmd btool inputs list monitor
[monitor:///opt/splunkforwarder/etc/splunk.version]
_TCP_ROUTING = *
_rcvbuf = 1572864
host = media
index = _internal
sourcetype = splunk_version
[monitor:///opt/splunkforwarder/var/log/splunk]
_rcvbuf = 1572864
host = media
index = _internal
[monitor:///opt/splunkforwarder/var/log/splunk/metrics.log]
_TCP_ROUTING = *
_rcvbuf = 1572864
host = media
index = _internal
[monitor:///opt/splunkforwarder/var/log/splunk/splunkd.log]
_TCP_ROUTING = *
_rcvbuf = 1572864
host = media
index = _internal
Thanks for the link on the indexing pipeline. This one is also a great resource to understanding the internals
http://wiki.splunk.com/Community:HowIndexingWorks
Section 4 deals with a distributed environment.
I have used _indextime quite extensively in the past for alerting on data that is "late" or servers that might have incorrect time syncing.
Now see this discussion is why I would like to find some official information. We both obviously know a lot about the Splunk internals but there are still some mysteries here.
So I did some more testing based on your idea that this data is in a memory queue. Bear with me while I explain.
No apparent data loss. It still appears that the UF sees that indexer is down and pauses indexing.
http://imgur.com/a/JkvX0
I'll try the same thing overnight after work today just incase there is some other buffer at play here. 1 hour is a pretty short test.
Again I am more than happy to be wrong in my assumption that the UF pauses for file monitoring (and possibly WinEventLog and perfmon) I just wish they documented it somewhere. They pretty clearly explain what happens with transient data like TCP and scripted inputs but not as much for file monitoring.
Edit:
Forgot to mention I did a check on the "approx" volume of the data being ingested
index=_internal host=media earliest=-60m | eval eventBytes = len(_raw) | stats sum(eventBytes)
1115952 chars per hour (which should equate to bytes???)
Alright so it sounds like we've cleared up the internal logging question (yay btool!)
I appreciate this discussion as well and thank you for bringing it up as some people may overlook this sort of thing. And while it's most of the time not an issue, I can see where some people may have this question and seek this type of specificity.
I read through your test case and I think it's good but let me explain a little bit more of what you may have experienced. Since the in-memory queue is so small (500KB by default) when you're dealing with a large amount of data, that 500KB can be somewhat negligible and hard to see. In your case when you turned off the UF in step 3 you only lost 500KB of data. When you turned it back on in step 5 and did the search again, that UF had started back up and began indexing the data that never made it into the queue in the first place. The reason you probably seemingly saw all of your data is because that 500KB is so tiny and Splunk handles failures quite well. Take a look at the following test case:
I just performed the following:
What this tells us is that anything in memory when the Universal Forwarder fails may be lost. The fact that it is 500KB and so small is generally negligible when we're talking about gigabytes, terabytes, and even petabytes of data per day. Additionally, 500KB is the maximum amount of data you could lose if the queues are completely blocked for file monitoring. Since Splunk processes data so fast, typically you won't even see anywhere near that amount be lost unless you're monitoring a LOT of information in which case you may need to modify other options like your maxKBps or switch to a Heavy Forwarder.
Nice that all makes sense I appreciate the discussion.
When I get the time I will write a script to generate logs so I can see this loss/gap in action. It is not that I don't believe you I just want to see it happening with my own eyes 🙂
As you said it is a tiny fraction of a loss but it is good to know that the file monitoring does actually pause processing the log until the UF queues are "unblocked". That was my major concern.
It would be great if this was covered clearly in the documentation though. They talk about buffers for the tcp and scripted inputs but nothing is said about the other "types" of inputs.
Thanks again.