We need to provide report, where we need to capture how long Splunk instance was down in past.
Is it possible to capture using internal logs? What Splunk query can we use to get the duration?
Note: Currently Splunk instances are up and running.
By default, Splunk only stores the
_* logs for 30 days so if you need to go farther back than that, you can infer an outage by looking for a large jump in
latency as defined as
_time subtracted from
Splink writes every 10 seconds in the resource_usage.log. With this query you can find gaps in the logging which can indicate when the splunk process was down. This is only an estimation, you have to add/substract the time splunk need to start/shutdown.
If other splunk instances send internal logs to the indexing layer (always the best practice) then you can find the "downtime" for other splunk instances by specifying the host:
index=_introspection sourcetype=splunk_resource_usage host=XXX
You can search the splunkd.log files for "Shutdown complete" and "Splunkd starting" then calculate the difference between those events.
I've seen that when Splunk is restarted from the GUI. In that case, there is no indication of the restart. I don't have a solution for that case.