Re: Asynchronous forwarding and event duplication

PickleRick · ‎03-09-2023

Trying to solve other problem, I started fiddling with outputs on my HFs and did https://www.linkedin.com/pulse/splunk-asynchronous-forwarding-lightning-fast-data-ingestor-rawat/ along with other tweaks (including lowering absurdly high output queue).

The HFs are an intermediate forwarder layer receiving data from several UFs as well as from HEC inputs.

Firstly I set up the outputs.conf like that:

[tcpout]
defaultGroup = my_indexers
forwardedindex.filter.disable = true
indexAndForward = false
useACK=true
maxQueueSize = 2GB
forceTimebasedAutoLB = false
autoLBVolume = 52428800
writeTimeout = 30
connectionTimeout = 10
connectionTTL = 300
heartbeatFrequency = 15

#SSL Settings
useSSL = true
<here are some SSL settings unimportant here>

[tcpout:my_indexers]
<here goes list of my indexers of course>

When I checked how many times I'm getting a particular event in two relatively mildly utilised indexes (so that I don't kill my indexers by statsing this) - in my case source shows IP of source server so the combination of timestamp, raw event and source should be unique. timstamp and raw on their own can yield the same events from different hosts.

index=<something> 
| eval timeraw=_time."-"._raw.source
| stats list(splunk_server) as splunk_server by timeraw
| eval c=mvcount(splunk_server) 
| stats count by c

It seemed that for last 15 minutes some110k events are returned once, around 25k events returned twice and 10 events are returned from thrice.

While fiddling around with the settings I lowered the autoLBVolume by two orders to just 524288 (each HF has two pipelines and handles around 1MBps traffic so the calculation was pretty much conforming to that LinkedIn article. And magically duplicates don't seem to be showing up in logs. But can someone please tell me why? Why would a chunk of data be sent to multiple indexers when I had bigger autoLBVolume? I don't get it.

Tom_Lundie · ‎03-09-2023

Throwing my two cents in, not 100% sure but I suspect that this could be down to indexer acknowledgement.

I understand that for an indexer to return an acknowledgment, it must handle the event completely, including the replication to other indexers, meaning it is nowhere close to an instantaneous process.

Your autoLBVolume is around 50MB but the article suggests it should be:

autoLBVolume = <average_kbps/ingest pipeline>

Which would be 1MBps / 2 = 0.5MBps

The goal here is to keep your pipelines constantly switching indexers ~ every second. Having an autoLBVolume 100x larger means that you're essentially loading up any given indexer with data for around 100x longer than the article is expecting.

This is enough to start filling the large waitQueue (3 * larger than the maxQueueSize) before the autoLB stratergy kicks in and cuts to the next indexer.

I'm not sure exactly, but if you see a large increase of the indexing queue on your indexers with the larger autoLBVolume that would indicate there is an issue with completing the event handling and returning the ACK.

PickleRick · ‎03-09-2023

And here is the plot twist 😉 There is no replication downstream.

It's most probably due to the outputs working a bit differently than I expected (most probably waiting for filling up the queue with default autoBatch=true setting). I'll have to do some debugging later.

hrawat_splunk · ‎04-26-2023

" (most probably waiting for filling up the queue with default autoBatch=true setting). "

autoBatch does not wait. It transmits multiple events in one tcp payload, if there are more than one outstanding events in the tcpout queue. If there is just one outstanding event in tcpout queue, it transmits one event in one tcp payload.

Without autoBatch, it's always one event per tcp payload regardless of how many outstanding events in tcpout queue.

hrawat_splunk · ‎04-26-2023

By setting 50MB autoLBVolume, Asynchronous is not on. You wrote

"While fiddling around with the settings I lowered the autoLBVolume by two orders to just 524288 (each HF has two pipelines and handles around 1MBps traffic so the calculation was pretty much conforming to that LinkedIn article. And magically duplicates don't seem to be showing up in logs. "

That means to hit autoLBVolume limit, it will take 50 sec. But default autoLBFrequency is 30 sec, so likely autoLBFrequency was applied not autoLBVolume.
However you may want to check if there are
"WARN TcpOutputProc - Possible duplication of events with" logs on fwd side.
Duplication of events is not related to async or sync forwarding. It could be pausing on indexers see https://community.splunk.com/t5/Splunk-Enterprise/The-index-processor-has-paused-data-flow-How-to-op...

Asynchronous forwarding and event duplication

heavy forwarder

intermediate forwarder

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio