I have the universal forwarder installed on one of our Windows servers and it has been forwarding logs with no issue for some time now till this morning. One of our dashboards didn't report getting anything from this server this morning, and that is unusual as it reports back a specific process it runs every morning.
I did some looking and a basic search in Splunk of "host=servernamehere" and set the time range to "all time", resulted in getting back all the events on that server up till about 8:30 last night but nothing since then. I remoted into the server and verified the Universal Forwarder service is running I checked the event logs to see if for some strange reason there were no new events in the system since last night and everything checked out fine as the services was running and of course there were tones of new events getting logged. I then went back to Splunk>Setting>Forwarding Management and checked to see the last time that server had phoned home, and it showed a few seconds ago.
So that is where I am at, everything I know to check looks fine but yet there are no new logs coming from that server into Splunk. I have a feeling a simple reboot will resolve this issue but I am afraid this may happen again or is happening now on other servers.
A. Is there anyway to troubleshoot even further to find the cause of this issue?
B. Can I setup a way to detect this happening on other servers going forward?
C. Is there a way to prevent this from happening again ?
okay found the issue, it was a PEBKAC problem (at least mostly). When I remoted into the server to troubleshoot the issue I remoted into a different server with an almost identical name and did the troubleshooting there instead. Did some back tracking and it ended up the actual server with the issue reset around the time the logs stopped and for some reason the forwarder service just never came back up after the reboot.
I went in and turned the service back on and things started back up just fine.
okay found the issue, it was a PEBKAC problem (at least mostly). When I remoted into the server to troubleshoot the issue I remoted into a different server with an almost identical name and did the troubleshooting there instead. Did some back tracking and it ended up the actual server with the issue reset around the time the logs stopped and for some reason the forwarder service just never came back up after the reboot.
I went in and turned the service back on and things started back up just fine.
Hi @kpers, I know this question was so long ago. I am currently encountering the same issue as you and was wondering what forwarder service you turned back on and could you please elaborate? The UF is installed in a windows server and I'm not that familiar with any services that Splunk uses.
It sounds like you need to open a case. Did you run a real-time (All time) search to see if events are coming in, but with a delay? We has a problem with delay and had to turn off AD name resolution with:
[WinEventLog:Security]
evt_resolve_ad_obj = 0
I did, I even waited like a half hour and came back and checked again before I responded back.
I hate to hear that adding AD name resolution may add a delay as I am new to Splunk and am looking on how to go back and push that config change to all the config files on the Windows servers I have already put Splunk on to get better name reporting in my Dashboard panels. I hope once I figure it out and turn it on I don't have the same issue.
I will log a ticket with support and see where it goes.
A good place to start is the _internal logs..
index=_internal host=servernamehere
See if you can find anything in the logs that explains the reason why the data is not being forwarded
Okay ran it and the majority of it looked like communication reporting type logs going back as far as I could see till right at the end were you do see some references to shutting down but I am not sure what caused it but it did happen right around the time it stopped logging. Here is part of the last bit of output it logged:
07-19-2015 08:31:43.761 -0500 INFO TcpInputProc - Setting up input quiesce timeout for : 15 secs
07-19-2015 08:31:43.761 -0500 INFO TcpInputProc - Shutting down listening ports
07-19-2015 08:31:43.761 -0500 INFO TcpInputProc - Running shutdown level 1. Closing listening ports.
07-19-2015 08:31:43.761 -0500 INFO ShutdownHandler - shutting down level "ShutdownLevel_TcpInput1"
07-19-2015 08:31:43.761 -0500 INFO ShutdownHandler - shutting down level "ShutdownLevel_Thruput"
07-19-2015 08:31:43.761 -0500 INFO ShutdownHandler - shutting down level "ShutdownLevel_KVStore"
07-19-2015 08:31:43.761 -0500 INFO ShutdownHandler - shutting down level "ShutdownLevel_Begin"
07-19-2015 08:31:43.761 -0500 INFO ShutdownHandler - Shutting down splunkd
07-19-2015 08:31:43.761 -0500 INFO loader - Shutdown HTTPDispatchThread
07-19-2015 08:31:43.761 -0500 INFO PipelineComponent - Performing early shutdown tasks
07-19-2015 08:31:13.497 -0500 INFO Metrics - group=tpool, name=bundlereplthreadpool, qsize=0, workers=0, qwork_units=0
07-19-2015 08:31:13.497 -0500 INFO Metrics - group=tpool, name=batchreadertpool, qsize=0, workers=1, qwork_units=0
07-19-2015 08:31:13.497 -0500 INFO Metrics - group=tcpout_connections, name=default-autolb-group:172.16.51.161:9997:0, sourcePort=8089, destIp=172.16.51.161, destPort=9997, _tcp_Bps=217.47, _tcp_KBps=0.21, _tcp_avg_thruput=0.20, _tcp_Kprocessed=2939, _tcp_eps=0.30, kb=6.37
Okay, so the forwarder shut down and never resumed forwarding. Which means it didn't forward any subsequent _internal logs. On the Forwarder, check $SPLUNK_HOME/var/log/splunk/splunkd.log and see what it was reporting after the shutdown. That might give you some detail.
Also check to make sure the process is actually running. Restarting it could clear this up. (Although it won't tell you why it ended up in a problem state to begin with)
I checked the spluknd.log on that server and there seem to be zero change from the events before and after it stopped reporting as they all looked like this:
07-20-2015 08:39:01.520 -0500 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_ServerIPHere_Port_ServerNameHere
I did a double check on the Universal Forwarder service on that box and it was still on so I reset it thinking you were going to be right and it would just start back up as if nothing happened. What it ended up doing to my surprise was nothing like before as I still see no new events when I search for that server.