Splunk Search

How can we write a real-time or relative search to check server uptime for last 30 seconds?

Path Finder

Hi,

We have below search which would give us server uptime. We need to select ALL TIME or last time server recorder uptime duration to pick the results.

For Ex:
If the server is up and running from Jan 9 2016, we can select the time range as ALL TIME OR 30 days to get the results. If we select time duration to less than 30 days, it won't retrieve us any results.

Requirement: (We are looking for): Need to write a real-time or relative search with duration as 30 secs to check whether the server is up and running:

 index=F5  "monitor status"  F5_Server_Infor="Server1"| eval latest_time=now()|eval Server_Availability=case(F5_Server_Infor="Server1","Server"  )  | stats  latest(F5_TCPStatus) as latest_status by Server_Availability,latest_time  | convert  ctime(latest_time) as Last_Run_Time  | fields  - latest_time,Server_Availability ,latest_status

Just our Thoughts: Our idea is, when ever the search runs, it should check last run status and _time. If the server is up from last runtime, it should retrieve results as it is up currently (from 30 secs window) in such a way this would work.

Can someone please help us and advise how we can build a real-time search for this?

Thanks
Sarath

0 Karma

SplunkTrust
SplunkTrust

It appears from your sample data that servers report only when their status changes, so looking only at the last 30 seconds may not be enough. Try this. It returns the most recent status for all hosts.

index=F5 "monitor status" | dedup host | table _time host status
---
If this reply helps you, an upvote would be appreciated.
0 Karma

SplunkTrust
SplunkTrust

Can you provide a sample of the events that are recording the status of Server1? (Are they from an F5 device?) I think if you could post a couple of those (preferably two or three in a row), we'll probably be able to work you up an answer.

Path Finder

Hi,

Below are the few sample events:

1/17/16 
6:26:21.000 PM
Jan 17 18:26:21 10.38.34.91 Jan 17 18:26:21 CTPLTM1BP notice mcpd[7009]: 01070727:5: Pool /Common/F_443_pool member /Common/server1:443 monitor status up. [ /Common/F_443: up ]  [ was down for 0hr:3mins:56sec ]
F5_TCPStatus = up host = 10.38.34.91 index = f5 source = udp:10500 sourcetype = F5 splunk_server = host1

1/17/16 
6:26:21.000 PM  
Jan 17 18:26:21 10.38.34.91 Jan 17 18:26:21 CTPLTM1AP notice mcpd[7341]: 01070727:5: Pool /Common/F_443_pool member /Common/server1:443 monitor status up. [ /Common/F_443: up ]  [ was down for 0hr:3mins:57sec ]
F5_TCPStatus = up host = 10.38.34.91 index = f5 source = udp:10500 sourcetype = F5 splunk_server = host1

1/17/16 
6:22:25.000 PM  
Jan 17 18:22:25 10.38.34.92 Jan 17 18:22:25 CTPLTM1BP notice mcpd[7009]: 01070638:5: Pool /Common/F_443_pool member /Common/server1:443 monitor status down. [ /Common/F_443: down ]  [ was up for 0hr:11mins:4sec ]
F5_TCPStatus = down host = 10.38.34.92 index = f5 source = udp:10500 sourcetype = F5 splunk_server = host1

1/17/16 
6:22:24.000 PM  
Jan 17 18:22:24 10.38.34.91 Jan 17 18:22:24 CTPLTM1AP notice mcpd[7341]: 01070638:5: Pool /Common/F_443_pool member /Common/server1:443 monitor status down. [ /Common/F_443: down ]  [ was up for 0hr:11mins:4sec ]
F5_TCPStatus = down host = 10.38.34.91 index = f5 source = udp:10500 sourcetype = F5 splunk_server = host1

1/17/16 
6:11:21.000 PM  
Jan 17 18:11:21 10.38.34.92 Jan 17 18:11:21 CTPLTM1BP notice mcpd[7009]: 01070727:5: Pool /Common/F_443_pool member /Common/server1:443 monitor status up. [ /Common/F_443: up ]  [ was down for 0hr:5mins:6sec ]

Thanks
Sarath

0 Karma

SplunkTrust
SplunkTrust

Thanks!

A few questions and comments to clarify:

If the F5 reports "monitor status down" when something's down, why not just search for "monitor status down" and make a real time alert on that? A very focused real time search like that is less worse than a rather vague and broad one.

Does the "monitor status down" events get continually (or periodically) logged by the F5 as long as the server isn't responding? Like does that log entry repeat once an hour (or on some regular basis) until it's up again? Or does it just get logged once then no messages happen until it's back up again? (E.g. I'm wondering how far back we'll have to go to find out the current status of something. From what you mention above I wonder if you wouldn't be better served with a summary index keeping a state table of some sort.)

May I ask why 30 seconds? Is this actually useful (as opposed to a one-minute scheduled thing)? Or is this just management requires it? (That last can be legitimate, just wondering how much flexibility there is).

Meanwhile while I've been faffing about writing this set of questions, someone else has probably answered this for you. 🙂

0 Karma

SplunkTrust
SplunkTrust

Are your servers reporting every 30 seconds (or faster)? If not, searching every 30 seconds is just wasting resources.

If you want to know the last time a server reported in, why are you using now() in your query? I wouldn't expect any server to have the current time in their event data.

---
If this reply helps you, an upvote would be appreciated.
0 Karma

Path Finder

Yes, All these are production servers. Management wants this report for every 30secs.

We have used now() just to get the search run for now and retrieve the status as of now,
but here we need to select either ALL TIME or period of days from when server is up and running(lets say if server is running from 09 Jan, we will also get results if we select as 30 days) ,but if I select earliest time as 1day or yesterday it is not retrieving any results.

Now what we are expecting is our search needs to retrieve results if we select time as 15 mins(less than that) from time picker it should retrieve us results

0 Karma