Getting Data In

UF on Syslog Server - Enqueuing Large Files - Logs Delayed

ahmadgul21
Explorer

Hi,

I would like some advice on how to best resolve my current issue with regards to the Logs being delayed that are being sent from Syslog Server via a Universal Forwarder.

Situation Details:

We have a central Syslog Server (Linux Based) that is handling all network devices Logs and storing them locally. The average size of daily log volume on the Syslog Server is about 800 GB to 1 TB approx. We have configured retention policies such that only 2 days of Logs are kept locally on the Syslog Server. 

To onboard the logs into Splunk we have deployed a Universal Forwarder on the Syslog Server and configured the below to optimize it, 

server.conf

 

[general]
parallelIngestionPipelines = 2

[queue=parsingQueue]
maxSize = 10MB

 

limits.conf

 

[thruput]
maxKBps = 0

 

 However, even now we are getting many events in splunkd.log of 

 

WARN Tail Reader - Enqueuing a very large file=/logs/.... in the batch reader, with bytes_to_read=xxxxxxx, reading of other large files could be delayed

 

The effect of this is that even though Logs are being forwarded to Splunk but there is a large delay time associated with when the logs are received on Syslog Server and when they are indexed/searchable in Splunk. There are many dips in logs received in Splunk when investigating and usually these dips recover as soon as we start receiving the logs that are sent by Splunk UF. However in some cases, the logs are missed because it seems by the time the Splunk agent reaches the log file to read 2 days have passed so that the log file is no longer available as per retention policies. 

Would appreciate any helpful advice on how to go about addressing this issue. Thank you. 

Labels (2)
Tags (1)
0 Karma
1 Solution

richgalloway
SplunkTrust
SplunkTrust

I suspect you are overwhelming your Splunk infrastructure with 800GB/day.  How many indexers are ingesting all that data?  I hope it's at least 3.

Do you see any log messages saying the UF has paused output?  That would be a sign that the indexers can't keep up.

Consider running multiple syslog servers, each with a UF, to help distribute the load.  If that's not possible then consider running multiple UFs on the one syslog server, which each UF monitoring a separate file/directory.

---
If this reply helps you, Karma would be appreciated.

View solution in original post

ahmadgul21
Explorer

There are 5 indexers that the syslog server UF is forwarding to. 

I do see messages in splunkd that 

TailingProcessor - Could not send data to output queue (parsingQueue), retrying...

We are looking at the multiple syslog servers option but could you let me know more about the multiple UF on a single syslog server option? Each splunkd process would be taking CPU resources then and so the specs of the server if increases can let us proceed with the multiple UF option?

 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Have you looked at the indexer queues to see if they are blocked?

Have you set maxKBps=0 on the UFs?

Yes, when running multiple UFs on a single server each UF consumes additional resources so that's only an option if sufficient resources are available.

One also must be careful to configure each UF so they are independent: separate installation directories, separate ports, and (most important) monitor separate files/directories.

---
If this reply helps you, Karma would be appreciated.
0 Karma

ahmadgul21
Explorer

Hi @richgalloway ,

An update, we have been trying out different options such as increasing the parsingQueue from10 MB to about 100 MB as per the conversations with support. However, the delay is still there. 

Coming to your two options,  (a) running multiple syslog servers with UF to distribute load - that is being considered however might take time so I'm thinking of trying out your second option (b) running multiple (2) UF on a single syslog server with each UF monitoring different/separate directories. 

What I'm most curious and apprehensive about with the option (b) is if another UF is installed on the syslog server and tries to monitor a separate directory that was previously being monitored by the existing UF then how do we avoid re-indexing/duplication of events? Since, there were some events that were already onboarded from that directory by the existing UF and due to the delay the remaining were not yet onboarded while we disabled monitoring for the existing UF and enabled the new UF to monitor that directory. Is there a way to avoid re-indexing/duplication of the events (if my understanding is correct and this scenario can occur). To add here, the directory contains multiple log files for each hour of the day as and when events received from the network devices.

Thank you for your time and advice, much appreciated. 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Good question.  The UF keeps track of its place in each file it monitors.  Installing a brand-new UF of course would not have that information and so is likely to re-ingest data.  Avoid that by cloning the existing UF into a separate directory and then modifying the inputs.conf files as desired.

cp -pr /opt/splunkforwarder /opt/splunkforwarder2
splunk clone-prep-clear-config
---
If this reply helps you, Karma would be appreciated.

ahmadgul21
Explorer

Hi @richgalloway ,

Yes the maxKBps value has been set to zero.

Further, a case was opened with support and it was shared with them that we were getting lots of Enqueuing a very large file with bytes_to_read events and they suggested to increase the value to a higher value or larger than the file size as a workaround that would help out with the delays issue. 

Response by support
Splunk by default when forwarding large events/logs, they will stop using the tailreader for data ingestion, and will pass the event/log to the batch reader for forwarding. The batch reader by default has a limit set for 20,971,520 bytes, after which will handle the events and cause enqueuing of events. 

Workaround

To overcome this, We increased the min_batch_size_bytes in limits.conf to a higher value or larger than the file size to have the events being handled by trailing processor instead than batch reader which was causing the stop this resolved the situation.

Current contents of limits.conf file 

[thruput]
maxKBps = 0

[default]
min_batch_size_bytes = 1000000000

Even now we are still getting the Enqueuing a very large file events in splunkd.log because some of the log files on the Syslog Server have a size of 17 GB almost. 

I'm thinking of increasing the min_batch_size_bytes to maybe 10 GB, have you any suggestions regarding this method?

0 Karma

richgalloway
SplunkTrust
SplunkTrust

If you have files up to 17GB in size then I'd set min_batch_size_bytes to 17GB or larger.  This is outside my experience, however, so I can't say if it will help or not.

---
If this reply helps you, Karma would be appreciated.
0 Karma

richgalloway
SplunkTrust
SplunkTrust

I suspect you are overwhelming your Splunk infrastructure with 800GB/day.  How many indexers are ingesting all that data?  I hope it's at least 3.

Do you see any log messages saying the UF has paused output?  That would be a sign that the indexers can't keep up.

Consider running multiple syslog servers, each with a UF, to help distribute the load.  If that's not possible then consider running multiple UFs on the one syslog server, which each UF monitoring a separate file/directory.

---
If this reply helps you, Karma would be appreciated.

ahmadgul21
Explorer

I've accepted this as the solution since distributing the load would definitely help resolve the issue but since I was looking for a workaround to avoid that - had kept this open and kind of forgot about this thread later on - but for anyone else having the same issue - we did end up dividing the load on two Syslog Servers and the issue has accordingly reduced - there is still some delay some times but that is probably due to still the load being too much for even two Syslog Servers sometimes during peak.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...