We have a Splunk server that is receiving data from more than 10 forwarders. It also receives data directly via UDP and monitors files on network shares.
We have scheduled searches to monitor and alert if a host stops sending data.
Ocasionally, Splunk reports all the forwarders have stopped sending data. Diagnosing we find that:
- Splunk is running ok and all UPD and local monitor files are working and receiving data
- All forwarders are up and running but Splunk is not indexing any data from them
- Restarting Splunk does not solve the issue
- Restaring the server solves the problem
We can't find any server logs that indicate a network problem.
Any ideias on how to diagnose this ?
We're running 4.3.3 on Windows Server 2008 64bit.
I would start by looking at splunkd.log on the forwarder in the $SPLUNK_HOME/var/log/splunk folder for messages that are from 'TcpOutputProc', they should give you an indication as to what is occurring when the forwarder tries to connect to the indexer.
I spent too much of a day trying to figure out why 2 of 5 servers were not showing up in my Indexer. I tried removing then adding the forward-server information, restarting the forwarder over and over and even reinstalling the forwarder on each, but they just didn't show up. Using telnet I confirmed the connection was open and the Indexer was listening. In the forwarders' splunkd.log files I confirmed the connection was being made. Finally I happened to change my search string to "index=_internal host=*" and there they were, but there was only one source from each and it was $SPLUNK_HOME/var/log/splunk/splunkd.log. The other 3 servers that were working had many more sources. A little bit more searching and I found this command:
$SPLUNK_HOME/bin/splunk list monitor
Sure enough, only splunkd.log was being forwarded. So I ran:
$SPLUNK_HOME/bin/splunk add monitor /var/log
This was what the "working" servers showed in their "list monitor" results. When running "host=*" in the search app, there were the 2 servers that wouldn't work before.
Our problems seem to be performance related since we are using a network storage for the index.
After a little help from Splunk Support we found a few messages indicating the queues were blocked probably because of network performance issues. The message is "Stopping all listening ports. Queues blocked for more than 300 seconds".
We are still testing everything to make sure that's the correct diagnose. If we find something I will post here !
I too am facing a similar issue, does it got resolved for you (and how)?
I would start by looking at splunkd.log on the forwarder in the $SPLUNK_HOME/var/log/splunk folder for messages that are from 'TcpOutputProc', they should give you an indication as to what is occurring when the forwarder tries to connect to the indexer.
Thanks.
We run a scheduled search to monitor if data is not being received from the forwarders and caputres the most recent time entry (when the data stopped)
I was able to see that, at that time, all the forwarders were reporting Connection Failed (on TcpOutputProc)
It seems to be a problem at the receiving index... The indexer is still running but TCP data is just not allowed to pass (although UDP and local monitoring works)
Maybe it's some kind of firewall issue, any ideia on how to diagnose this further ?