Splunk Universal Forwarder and TCP Data: What exac...

tsunamii · ‎04-04-2016

According to the doc here:
http://docs.splunk.com/Documentation/Splunk/6.3.3/Forwarding/Setuploadbalancingd

Important: Universal forwarders are not able to switch indexers when monitoring TCP network streams of data (including Syslog) unless an EOF is reached or an indexer goes down, at which point the forwarder will switch to the next indexer in the list. Because the universal forwarder does not parse the data and identify event boundaries before forwarding the data to the indexer (unlike a heavy forwarder), it has no way of knowing when it's safe to switch to the next indexer unless it receives an EOF.

We would like to know what exactly is Splunk UF looking for to determine EOF?

Additional info:
Currently, our app sending to the UF's TCP port does not use an EOF marker. This causes the UF to send data to the same indexer since it cannot switch to another indexer. As a result, we set forceTimebasedAutoLB=true to force the UF to switch indexers. However, our tests show that the UF fails to send events when this is set. For example, the following configuration:

autoLB = true
autoLBFrequency = 5
forceTimebasedAutoLB = true

results in approximately 80% event loss when sending events received via TCP port at a rate of 1 event per second. From our testing:

autoLB=F forceTimebasedAutoLB=F -> okay
autoLB=F forceTimebasedAutoLB=T -> dropped events
autoLB=T forceTimebasedAutoLB=T -> dropped events
autoLB=T forceTimebasedAutoLB=F -> okay

tsunamii · ‎04-27-2016

At the moment, we have worked around the issue by configured our app to pause every 15 sec so that UF will send done key. We have also disabled forceTimebasedAutoLB.

Btw, fyi: according to splunk support, if a sourcetype is set, the tcp client should not need to pause for some time. Tcp client can send data to forwarder continuously. When the "sourcetype" is not set correctly, there could be quite some event loss. If you have sourcetype set, but is still experiencing event drops, please contact splunk support and refer to SPL-117189.

splunkIT · ‎04-06-2016

When reading a file, this can only be done when the forwarder hits EOF. For TCP, it’s when the forwarder does not get data on a port for 10 seconds (default rawTcpDoneTimeout value): http://blogs.splunk.com/2014/03/18/time-based-load-balancing

Adjust this inputs.conf parameter value according to your requirements:
http://docs.splunk.com/Documentation/Splunk/latest/Admin/inputsconf

rawTcpDoneTimeout = <seconds> 
* Specifies timeout value for sending Done-key. 
* If a connection over this port remains idle after receiving data for 
specified seconds, it adds a Done-key, thus declaring the last event has been 
completely received. 
* Defaults to 10 second.

hexx · ‎04-05-2016

To answer this question:

We would like to know what exactly is Splunk UF looking for to determine EOF?

...the context here is specific to reading files from disk with a [monitor] or [batch] data input.

Here's how splunkd decides that a it has truly hit EOF (end-of-file) for a file it is reading:

splunkd reads until the filesystem indicates that the end of the file has been reached
splunkd backs off for 3 seconds (configured in inputs.conf / time_before_close)
splunkd checks the end of the file again - if it hasn't moved, splunkd considers that it has truly hit EOF and can move on from the file. Most notably, it can now end the specific data stream sending events from this file to an indexer.
if EOF has moved, the new data is read and splunkd will again wait for time_before_close seconds after the new EOF is found.

Do note that this really only applies to Universal Forwarders, as they perform no parsing on the data they read and therefore have to wait for a "true EOF" in order to close a file-based data stream safely (i.e: without risking to cut events in half).

malmoore · ‎04-05-2016

The universal forwarder is looking for an actual end-of-file marker in the data stream. The EOF tells the forwarder that it's okay to move to the next available receiving indexer in the list, if it's set up for load balancing.

As the documentation text says, and as your tests prove, the universal forwarder has no idea how to handle a break in a data stream. If the break is forced, for example, if the other end of the socket closes the connection, it times out and then switches to the next indexer in the list. By using the forceTimebasedAutoLB you are in essence forcing a break in the stream, and that break likely occurs in the middle of an event. You will see this phenomenon as dropped events.

If you can't make your app send EOF, then you might need to use a heavy forwarder so that you can tell it when exactly the event breaks are. In fact, you might want to do this as a test anyway so that you can see how Splunk Enterprise treats your events. Our Getting Data In manual has additional information with regards to training Splunk Enterprise how to recognize event breaks and what to do when it encounters those patterns.

Splunk Universal Forwarder and TCP Data: What exactly is Splunk looking for to determine EOF?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

Splunk Universal Forwarder and TCP Data: What exactly is Splunk looking for to determine EOF?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits