Any Advice on Monitoring Universal Forwarder perfo...

Skeer-Jamf · ‎07-20-2023

So we have roughly a dozen UF hosts across on-prem and cloud. All are uploading data directly to SplunkCloud. I have had reports from other teams about decent gaps in reporting when they perform searches.

For example, performing a query like: index=it_site1_network for the last 2 hours. Currently has two large gaps of 25 minutes each.

Now before you ask what the activity level is on this index source, it's very high. There should be a few thousand events every minute.

I've checked $SPLUNK_HOME/var/log/splunk/splunkd.log to ensure the files monitored are indeed being monitored. And overall system resource util is very low (cpu, mem, disk, net).

My question is, is the metrics.log the only place to look for issues that might affect something like this?

PickleRick · ‎07-20-2023

Few thousands events every minute is not that high 🙂 I've seen worse.

But seriously. You mention gaps in your events. Do those gap fill up eventually or not? Because data loss is something completely different than delay.

Also - what kinds of inputs do you have? And what limits do you have set on your UFs?

Skeer-Jamf · ‎07-20-2023

I was able to narrow down the primary input that's causing 42 out of the 47.646077M events this host if trying to process.

Rsyslog is setup to receive data on 12 custom ports. Rulesets send that data to individual log files using individual queues. Then I have a custom 'app' with an inputs.conf monitoring each "rsyslog' file. The inputs file is pretty generic; index name, sourcetype, file/directory and that's it.

To be honest I don't know enough to know what I should use here that'd help a high traffic server. which is why I was looking for troubleshooting help.

PickleRick · ‎07-20-2023

Again - is it delay or holes? Do they fill up eventually? Maybe you're losinng events even before your rsyslog.

Skeer-Jamf · ‎07-20-2023

(Sorry, Didnt mean to ignore that part of the question. Was also looking at a different Splunk issue with a co-worker)

Performing a query: index=<problem_index> 'Last 24 hours' which resulted in 4248841 Events

There are two entire hours in the graph that are empty. The first prior to the first empty one contains 770,044 events. Then you have the empty hour, the next hour has 236,000, then another empty then the next 12 hours continue on more or less as usual averaging 200k. Finally the last three; 9, 10 and 11PM hours are all empty.

If I grep thru the log file this stanza is set to monitor, there are a lot of events with a timestamp of `06:xx:xx` (for example).

I tried querying the same index for the specific timestamp: 23-07-20T06:10:00Z

No results, on a whim I changed the timeframe to 'Last 7 days', nothing. So there are definitely events in the log file that are not making it into SplunkCloud.

PickleRick · ‎07-21-2023

OK. So as I understand it, there are - at least judging by time alone - events which should have been ingested but weren't. But that might not be the whole picture.

I'd go and sample some events from the files which supposedly weren't ingested and see whether they were indexed after all - just with another timestamp. That would mean you have some time recognition/parsing problem and the events are indexed just with a wrong timestamp.

Also check output of

splunk list monitor

and

splunk list inputstatus

Also check the _internal log for mentions of the files that seems to not have been ingested.

If that was simply a performance problem, the gaps should have eventually filled up so I suspect it's something else.

Skeer-Jamf · ‎07-21-2023

Aight so, an admittal if you will. I have been ignoring the fact that all the timestamps are in Zulu time. Therefor, some of what I've claimed is incorrect. So taking the time conversion into account in the below response:

./splunk list monitor

Shows all the files that should be monitored, are.

./splunk list inputstatus

Shows all the files as before.. the small, less active ones are type = finished reading. The larger, problematic ones are type = open file. All are 100%.

Re-running the Query, again for the past 24 hours I now have 4 empty hours (Z -> CDT; 2AM, 3AM, 4AM and 7AM), grepping the log file as before for a timestamp that's missing from SC.

So far the first three missing hours: 2, 3, and 4AM are indeed empty from the log file.

7am however has events after searching randomly for, 07:10, 07:34, 07:45..many events. So I have a question out to the networking team to verify there are events/alerts in the local log storage on a device from the group that sourced the missing data.

Still waiting for a response from networking.

isoutamo · ‎07-21-2023

One old conf presentation: How to Troubleshoot Blocked Ingestion Pipeline Queues. https://conf.splunk.com/files/2019/slides/FN1570.pdf This probably help you to check if there is some blocking queues which can cause this situation.

Skeer-Jamf · ‎07-21-2023

Actually stumbled across that earlier this morning. :thumbsup:

isoutamo · ‎07-20-2023

Hi

when you have some “high volume” files which are read by UF with default limits (see @gcusello ’s above answer) it reads just one file until it get the end of it. You should increase that limit e.g. to 1024 or even 0 (unlimited). Another thing what you could try is add additional pipelines for reading several files parallel.

r. Ismo

Skeer-Jamf · ‎07-20-2023

Hey, I was unaware of pipelines, I will try to look those up. You mention the default limits.. the suggestion to set maxkbps to 0 is network through, yes? Or did I misinterpret that?

Skeer-Jamf · ‎07-20-2023

Thanks @gcusello Yes I'm experiencing delays in SplunkClouds indexing.. I did find a mention of possibly shortening the ingestion interval on the UF but I have not yet found information on how to do this. One of my forwarders in particular has a pretty high usage on receipt of log data. Rsyslog on this host is processing upwards of 50M events in a 24hr period.

The CPU util is below 5%, memory below 20% and nic usage around 45KBps (average) with 0 transmit errors.

Also found a document explaining the parts of metrics.log, to be fair I'm not familiar enough with this yet to identify potential problems.

gcusello · ‎07-20-2023

Hi @Skeer-Jamf

UFs have a limitated use of bandwidth, that you can enlarge using the above parameter.

Check ig you have queues.

Ciao.

Giuseppe

Skeer-Jamf · ‎07-20-2023

ingestion queues? Yes that's what Im trtying to find information on.. the 'how-to' is escaping me so far. Splunks Troubleshooting Universal Forwarder page mentions just that. But they offer no links to further info and the search so far turns up nada.

gcusello · ‎07-20-2023

Hi @Skeer-Jamf ,

in my first answer I hinted a search to highlight eventual queues and the solution.

Ciao.

Giuseppe

Skeer-Jamf · ‎07-20-2023

Oh I don't doubt the query, except in my case it returned no matching data on the host in question. Actually querying the past 4 hours it found 37M events for one of our UF hosts in Japan. Still though, nothing for the one I'm working on.

But aside from that, a query is not a 'how' to configure an inputs.conf file's ingest rates. That's what I'm trying to discover.

gcusello · ‎07-20-2023

hi @Skeer-Jamf ,

if you have a large delay, there should be a queue, so try to apply the hinted update and check if the delay continues.

Ciao.

Giuseppe

Skeer-Jamf · ‎07-20-2023

So just to confirm, I have no queues appearing in the search query you suggested. But I should still apply the limits.conf change that appears to remove all throughput limits on the host in question, right?

To be fair and since you know more than I do I will give it a shot. But.. this completely ignores the idea of monitoring/troubleshooting to determine an actual cause. In essence you are suggesting to release the Kraken on the network when I cannot confirm that it's the network throughput limitations issue.

gcusello · ‎07-20-2023

Hi @Skeer-Jamf ,

let me understand: have you a delay on your data?

if this is your issue, maybe there are some queues and I' try to analyze them, using the Monitoring Console or running a search like this:

index=_internal  source=*metrics.log sourcetype=splunkd group=queue 
| eval name=case(name=="aggqueue","2 - Aggregation Queue",
 name=="indexqueue", "4 - Indexing Queue",
 name=="parsingqueue", "1 - Parsing Queue",
 name=="typingqueue", "3 - Typing Queue",
 name=="splunktcpin", "0 - TCP In Queue",
 name=="tcpin_cooked_pqueue", "0 - TCP In Queue") 
| eval max=if(isnotnull(max_size_kb),max_size_kb,max_size) 
| eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size) 
| eval fill_perc=round((curr/max)*100,2) 
| bin _time span=1m
| stats Median(fill_perc) AS "fill_percentage" max(max) AS max max(curr) AS curr by host, _time, name 
| where (fill_percentage>70 AND name!="4 - Indexing Queue") OR (fill_percentage>70 AND name="4 - Indexing Queue")
| sort -_time

that you can solve adding on each target that has the issue:

In limits.conf of Universal Forwarders
[thruput]
maxKBps = 0

Ciao.

Giuseppe

Any Advice on Monitoring Universal Forwarder performance?

forwarder

proactive Splunk component monitoring

resource usage

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio