We have a Universal Forwarder on our Linux rSyslog server. It was working fine until two weeks ago. The problem was it would stop sending data to the indexer, but showed no errors in the splunkd.log. When we would restart it, it would send a burst of information over the course of 4-5 minutes then stop sending data again.
Over the past two weeks we have replaced the rSyslog server with a new server. The new server has 8 cores, tons of memory, and a 10GB network connection to the Splunk indexer. Once we installed the forwarder it ran for two days non-stop catching up on the data that had been missed over the two week period. At 6pm last night it stopped forwarding data again. We're now back to the same problem we started with. We get a burst of log data on restart, but then it just stops. No errors, nothing to suggest we've hit any limits. The splunkforwarder.server process is still running. What we DO notice is that Splunkd holds the files open, and the number of open files continues to climb once it stops forwarding data. Some of these files are large, but we don't get any error messages about batch
in limits.conf we have this set
maxKBps = 0
max_fd = 10240
The ulimits on the server are set to 100000 - we're averaging about 4500-5000 before the forwarder stops running.
Universal Forwarder 7.3.1
We have about 35 windows forwarders working on servers with no issues at all. It's this one Linux forwarder that's not working correctly.
Any help you can give would be appreciate. Let me know if there is any additional information needed.
Still working with support on this. Starting to get pressure since it's been going on so long.
Made multiple changes to ulimits(now 32000soft, 64000hard), MaxQueuesize 256MB, inicrclength (500, or 700), and a host of other settings.
rSyslog is now rotating logs more frequently so the file size is much lower.
A bad workaround we are using now is a Cron job that restarts splunkforwarder.service when the opened files size hit a certain limit. We did this workaround because when the forwarder stops sending data, it is still opening and holding open files. Between 5k-6k open files is about when the forwarder stops sending data.
I am having the exact same issue regarding folder monitoring on a few on my UFs (not the others though oddly enough) and searching through the splukd and metrics logs there are no errors or warnings. It's so odd, but as soon as I restarted the forwarders I got all the logs and the backlog of logs.
My issue ended up being a problem with my inputs.conf file. The Universal Forwarder is on a logserver and has a rather large inputs.conf file. After working with Splunk support and my Linux team for about two weeks I ended up going my own route one night.
I commented out every entry in my inputs.conf file, and Uncommented them one at a time and restarted the forwarder. The first 3-4 that were "turned back on" worked fine, continuously sent data with no issues. I believe it was the 5th entry, that when uncommented, started causing the issue again. It turned out to be a chance I needed to make in a wildcard in one of my inputs.conf lines. It was a "..." instead of a "*". Not sure why this made a difference, or what changed to cause the problem, since it had been working for months. Regardless, after I made this change, I still went one by one uncommenting each line of the inputs file. I ran into no other issue.
This one wildcard issue was the cause of my forwarder not to work. it ran through the first 3-4 without issues then would hit that specific line and it would not move beyond that point in in the inputs file. This explained my burst of data then nothing. Coming up on a year with no issues since the correction was made.
Mine is similar but different. I'm monitoring a log server too but it's only one monitoring stanza and there are no subdirectories in this folder, only the rolling log files. If Splunk Support can't help me I'll try the ... but I don't think it will work in the same way.
... means look for all subdirectories
* means look files (dirs) which match to * but not recurse subdirectories.
Maybe someone has created "some" recursive subdirectories with lot of files?
Do you find any issues in the metrics logs?
Please look for any WARN | ERROR level messages in the metrics.log. There might be some blocked queues preventing forwarding of data.
You can find the blocked queues using below grep pattern on metrics.log
tail -f $SPLUNK_HOME/var/log/splunk/metrics.log | grep "blocked=true"
If you find any queues blocked, we need a fix there.