Solved: Why are there Delay/Missing logs from UF source?

BuzzLights10 · ‎05-06-2022

Hello all,

I have a clustered indexer and SH environment.

I'm now noticing that there's a long delay in some of my data showing up. I can see that the logs are being continuously generated at the source but they do not show up in Splunk until a long time later.

Some items I'm not able to search on until the next day. The UF is set to monitor a directory with all .log files to be read and sent to Splunk. No issues with permissions and no fw blocks either.

Additionally, the exact same configurations seem to work on my qa servers but not on the prod ones. The biggest difference between the two is the log volume, approximately in the ratio 1:600 qa to prod. Also, the file is set to roll over to archive once it hits the size of 50MB.

Does this have something to do with the zipping/archiving? Splunk unable to read when other processes are writing to the file, reach the limit and zip before Splunk can do anything?? Or would this concern something in the pipelines or limits.conf?

All help is appreciated.

Roy_9 · ‎05-09-2022

@BuzzLights10 We noticed there are blocked queues, upon looking at the below docs.https://conf.splunk.com/files/2019/slides/FN1570.pdf

We realized to raise the thruput to 2056kbps which eventually resolved the issue. This change doesn't affect the any other logging.

Hope this info helps.

View solution in original post

BuzzLights10 · ‎05-10-2022

@Roy_9 Thanks a lot for the pdf link, that's very helpful.

We increased the thruput in limits.conf and did the same for the maxSize in the parsingQueue stanza in server.conf

As of now, we can see a larger volume of data from the servers we increased the limits on and there is no delay in the past 12 hours either. Will probably do the same for the rest of them as well.

Thanks for your answers everyone!!

BuzzLights10 · ‎05-09-2022

Thank you for all the answers!

In our last session we identified that these log files were hitting the size limit and getting rolled over as a .txt file every 30 mins, which later got archived every 24 hours. We are currently not monitoring any .txt files so gonna try that first.

@isoutamo the block errors we saw were far and few in between, but if this doesn't solve it then we will be looking into increasing the throughput in limits.conf

@Roy_9 Thanks for the answer, is there any way you were able to determining the optimal value without overprovisioning kbps in limits.conf?? Or is trial and error the only way? Also, would any other logging be affected by this change?

@PickleRick no whitelist/blacklist rules in play at the moment, we will probably change the throughput in limits.conf next

Will keep you guys updated on this..thanks again for the answers!

Roy_9 · ‎05-09-2022

@BuzzLights10 We noticed there are blocked queues, upon looking at the below docs.https://conf.splunk.com/files/2019/slides/FN1570.pdf

We realized to raise the thruput to 2056kbps which eventually resolved the issue. This change doesn't affect the any other logging.

Hope this info helps.

Roy_9 · ‎05-06-2022

@BuzzLights10 I faced the similar issue, by increasing the thruput to 2048 kbps in limits.conf resolved it.

you can work by increments. By example 1024KBps, then 2048Kbps etc... until you do not see a huge delay in the indexing of the events

PickleRick · ‎05-06-2022

Hard to say anything without more detailed description (i.e. relevant configuration parts - especially the inputs).

Wild guess - maybe you have white/blacklists set so that the files get ingested only after getting rotated?

If there is a huge volume of events generated, you might look into throughput settings in limits.conf.

isoutamo · ‎05-06-2022

Hi

as @PickleRick thinks you probably have so much logs on production and these are continuously generating a new before UF can send all to indexers. I suppose that you have MC in use on your server? From there you could check what is the situation on UF side e.g. is it continuously sending e.g. 256KBps (default throughput limit). Also you could get more information of blocking queues etc. by queuing those from _internal (you could try to look correct query from community or some conf presentations). If this is the situation then you should increase the throughput in limits.conf. Also if you have huge amount of different log files you probably need to add some more pipelines to UF and ensure that you have enough filehandles reserver on linux level.

r. Ismo

Why are there Delay/Missing logs from UF source?

inputs.conf

source

universal forwarder

XML

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM