Hello, I'm looking for help showing the Uptime/downtime percentage for my Universal Forwarders (past 7 days) :
I've seen many people trying to solve a similar use case on Answers but haven't quite seen what I'm looking for yet.. I've been testing the below query and my thinking was to calculate the difference in minutes between a host's timestamp for eval field Action = "Splunkd Shutdown" - "Action = "Splunkd Starting". Then sum the total in minutes divided by the total minutes in 1 week (10080) to get the uptime? There are problems with this logic though because if the last time a host shutdown is not within your search window you won't get an accurate metric. I'm open to a discussion to see how this can be monitoring most accurately.
This query returns the host and timestamp for when splunkd shut down and another event with timestamp when Splunkd started.
index=_internal source="*SplunkUniversalForwarder*\\splunkd.log" (event_message="*Splunkd starting*" OR event_message="*Shutting down splunkd*") | eval Action = case(like(event_message, "%Splunkd starting%"), "Splunkd Starting", like(event_message, "%Shutting down splunkd%"), "Splunkd Shutdown")
| stats count by host, _time, Action
@johnward4 you are possibly looking for the /deployment/server/clients rest endpoint. (Refer to Splunk Documentation for details: https://docs.splunk.com/Documentation/Splunk/latest/RESTREF/RESTdeploy#deployment.2Fserver.2Fclients)
| rest splunk_server=local /services/deployment/server/clients | fieldformat lastPhoneHomeTime=strftime(lastPhoneHomeTime,"%Y/%m/%d %H:%M:%S.%3N")
@niketn The rest command you recommended looks like it's meant for the deployment server. I'm using Splunk Cloud and don't have any on-prem deployment server so I've tried using the index = _internal source=*splunkd.log to monitor if my UFs are online ...
I'm looking to show a % of Uptime for the past 7 days, looking for help on how you may subtract timestamps for two different values to show how long a host was down and then sum the total of that downtime divided by 7 days. Also open to suggestions for a better way to calculate this.
index=_internal source="*SplunkUniversalForwarder*\\splunkd.log" (event_message="*Splunkd starting*" OR event_message="*Shutting down splunkd*") | eval Action = case(like(event_message, "%Splunkd starting%"), "Splunkd Starting", like(event_message, "%Shutting down splunkd%"), "Splunkd Shutdown")
| stats count by host, _time, Action
This query returns the host and timestamp for when splunkd shut down and another event with timestamp when Splunkd started.
This query returns
| stats values(Action) as Action by host, _time
Since my hosts are Windows based I found this query to be helpful to show Uptime :
index=wineventlog host=* source="WinEventLog:System" EventCode=6013
| rex field=Message "The system uptime is (?<SystemUpTime>\d+) seconds."
| dedup host
| eval DaysUp=round(SystemUpTime/86400,2)
| eval Years=round(DaysUp/365,2)
| eval Months=round(DaysUp/30,2)
| table host DaysUp Years Months SystemUpTime
| sort host(index=wineventlog sourcetype=”WinEventLog:System” EventCode=6013)
| search DaysUp > 0
| strcat DaysUp " Days" UpTime
| sort - DaysUp
| table host UpTime
| fields - Years, Months, SystemUpTime
@johnward4 with the question I assumed you wanted real-time monitoring. If you want historical data that might be right approach. However, REST API would be fastest if you want to know what is down right now.
Also with the SPL you are using I think following does the same and would perform better:
index=wineventlog host=* source="WinEventLog:System" EventCode=6013
| fields host Message
| rex field=Message "The system uptime is (?<SystemUpTime>\d+) seconds."
| dedup host
| search SystemUpTime>86400
| eval UpTime=round(SystemUpTime/86400,2)
| sort - UpTime
| table host UpTime
| eval UpTime=UpTime." Days"
Also move the rex command to Field Extraction.