According to the doc here:
http://docs.splunk.com/Documentation/Splunk/6.3.3/Forwarding/Setuploadbalancingd
Important: Universal forwarders are not able to switch indexers when monitoring TCP network streams of data (including Syslog) unless an EOF is reached or an indexer goes down, at which point the forwarder will switch to the next indexer in the list. Because the universal forwarder does not parse the data and identify event boundaries before forwarding the data to the indexer (unlike a heavy forwarder), it has no way of knowing when it's safe to switch to the next indexer unless it receives an EOF.
We would like to know what exactly is Splunk UF looking for to determine EOF?
Additional info:
Currently, our app sending to the UF's TCP port does not use an EOF marker. This causes the UF to send data to the same indexer since it cannot switch to another indexer. As a result, we set forceTimebasedAutoLB=true to force the UF to switch indexers. However, our tests show that the UF fails to send events when this is set. For example, the following configuration:
autoLB = true
autoLBFrequency = 5
forceTimebasedAutoLB = true
results in approximately 80% event loss when sending events received via TCP port at a rate of 1 event per second. From our testing:
autoLB=F forceTimebasedAutoLB=F -> okay
autoLB=F forceTimebasedAutoLB=T -> dropped events
autoLB=T forceTimebasedAutoLB=T -> dropped events
autoLB=T forceTimebasedAutoLB=F -> okay
At the moment, we have worked around the issue by configured our app to pause every 15 sec so that UF will send done key. We have also disabled forceTimebasedAutoLB.
Btw, fyi: according to splunk support, if a sourcetype is set, the tcp client should not need to pause for some time. Tcp client can send data to forwarder continuously. When the "sourcetype" is not set correctly, there could be quite some event loss. If you have sourcetype set, but is still experiencing event drops, please contact splunk support and refer to SPL-117189.
When reading a file, this can only be done when the forwarder hits EOF. For TCP, it’s when the forwarder does not get data on a port for 10 seconds (default rawTcpDoneTimeout value): http://blogs.splunk.com/2014/03/18/time-based-load-balancing
Adjust this inputs.conf parameter value according to your requirements:
http://docs.splunk.com/Documentation/Splunk/latest/Admin/inputsconf
rawTcpDoneTimeout = <seconds>
* Specifies timeout value for sending Done-key.
* If a connection over this port remains idle after receiving data for
specified seconds, it adds a Done-key, thus declaring the last event has been
completely received.
* Defaults to 10 second.
To answer this question:
We would like to know what exactly is Splunk UF looking for to determine EOF?
...the context here is specific to reading files from disk with a [monitor]
or [batch]
data input.
Here's how splunkd decides that a it has truly hit EOF (end-of-file) for a file it is reading:
time_before_close
)time_before_close
seconds after the new EOF is found.Do note that this really only applies to Universal Forwarders, as they perform no parsing on the data they read and therefore have to wait for a "true EOF" in order to close a file-based data stream safely (i.e: without risking to cut events in half).
The universal forwarder is looking for an actual end-of-file marker in the data stream. The EOF tells the forwarder that it's okay to move to the next available receiving indexer in the list, if it's set up for load balancing.
As the documentation text says, and as your tests prove, the universal forwarder has no idea how to handle a break in a data stream. If the break is forced, for example, if the other end of the socket closes the connection, it times out and then switches to the next indexer in the list. By using the forceTimebasedAutoLB
you are in essence forcing a break in the stream, and that break likely occurs in the middle of an event. You will see this phenomenon as dropped events.
If you can't make your app send EOF, then you might need to use a heavy forwarder so that you can tell it when exactly the event breaks are. In fact, you might want to do this as a test anyway so that you can see how Splunk Enterprise treats your events. Our Getting Data In manual has additional information with regards to training Splunk Enterprise how to recognize event breaks and what to do when it encounters those patterns.