I need to find out list of all the servers where splunkd service is not running which were running before. I have more than 9000 forwarders and have three scenarios which are listed below:
Because of the above limitations, I am finding it difficult to use queries which are based on phone home or internal logs received in Splunk as its showing up incorrect server list.
Also, I'm not allowed to use script to monitor the splunkd service on each hosts as it requires remote login.
Currently I'm using internal logs to find out the up and down forwarders but looking for a better solution.
You can use this to get the last connected time and set a threshold to 60 seconds or more based on your configuration.
index=_internal |bucket _time span=1m | eval timenow=now() | convert timeformat="%b %d, %Y %H:%M:%S" mktime(_time) as LastInfoEvent mktime(timenow) AS currentTime| eval secondsSinceLastKeepAlive=(currentTime-_time) | stats min(secondsSinceLastKeepAlive) as secondsDead by host| sort secondsDead DESC
Adding to @dantimola
You can use the below three ideas and tweak it to your requirements :
No internal logs generated for UFs, you can generate an hourly search and identify the time duration since the last data was forwarded. You can use tstats command to reduce search processing
Internal Logs for Splunk can be checked and correlated with TCPOutput to see if it is failing
Internal Logs for Splunk and correlate with connections being phoned in with the DS. A UF should communicate with DS everytime a DS is restarted (this is the default parameter)
Hope you also have an asset database that would make it easier to correlate and reach out to end server admins.
Have you tried checking via Deployment Server? You can check the status of your universal forwarder in Deployment Server's Forwarder Management, have you also tried the query below?
| metadata type=hosts index=<index name> | convert ctime(*Time)