Splunk Search

How to search if a component has been down for some time and the event that it is up hasn't appeared for that period ?

alexeyglukhov
Path Finder

Hello all !
The task is to alert if a component (pool) is down for more than 10 minutes.

Some details:
There are down and up events for many pools (entry examples are below).
So firstly I started using a transaction to find those pairs of events and compare the duration to the threshold - works perfectly (the search string is below).

| transaction myvip pool_name_and_port mcpd_code
startswith=("component status down")
endswith=("component status up")
| where duration>600

But the problem is in order transaction to appear in the alert I always need the second message "up" (and delays can be hours/days), so it doesn't help to generate alerts immediately after the pool was down for more than 10 minutes.

So, I need your help to detect that a component was down for more than some period of time (10 minutes in my case) and there is no "component status up" event in that period of time.

Thank you in advance for any help !

Entry examples:
Feb 26 21:11:23 log_file_location_hostname_1 notice mcpd[9999]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:15mins:39sec ]
Feb 26 21:11:22 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:15mins:35sec ]
Feb 26 21:11:22 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:15mins:35sec ]
Feb 26 21:11:21 log_file_location_hostname_3 notice mcpd[7777]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:15mins:34sec ]
Feb 26 21:11:21 log_file_location_hostname_3 notice mcpd[7777]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:15mins:34sec ]
Feb 26 21:00:37 log_file_location_hostname_1 notice mcpd[9999]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:45sec ]
Feb 26 21:00:37 log_file_location_hostname_4 notice mcpd[6666]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:45sec ]
Feb 26 21:00:37 log_file_location_hostname_4 notice mcpd[6666]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:45sec ]
Feb 26 21:00:36 log_file_location_hostname_1 notice mcpd[9999]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:40sec ]
Feb 26 21:00:36 log_file_location_hostname_1 notice mcpd[9999]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:40sec ]
Feb 26 21:00:35 log_file_location_hostname_3 notice mcpd[7777]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:35 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:35 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:34 log_file_location_hostname_3 notice mcpd[7777]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:34 log_file_location_hostname_4 notice mcpd[6666]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:40sec ]
Feb 26 21:00:33 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:33 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 20:55:47 log_file_location_hostname_2 notice mcpd[8888]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:47 log_file_location_hostname_3 notice mcpd[7777]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:47 log_file_location_hostname_2 notice mcpd[8888]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:47 log_file_location_hostname_3 notice mcpd[7777]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:46 log_file_location_hostname_4 notice mcpd[6666]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:46 log_file_location_hostname_4 notice mcpd[6666]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:45 log_file_location_hostname_1 notice mcpd[9999]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:22sec ]
Feb 26 20:55:45 log_file_location_hostname_4 notice mcpd[6666]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:45 log_file_location_hostname_2 notice mcpd[8888]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:45 log_file_location_hostname_4 notice mcpd[6666]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:45 log_file_location_hostname_2 notice mcpd[8888]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:45 log_file_location_hostname_1 notice mcpd[9999]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:22sec ]
Feb 26 20:55:44 log_file_location_hostname_3 notice mcpd[7777]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:44 log_file_location_hostname_1 notice mcpd[9999]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:22sec ]
Feb 26 20:55:44 log_file_location_hostname_3 notice mcpd[7777]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:44 log_file_location_hostname_1 notice mcpd[9999]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:22sec ]

Tags (2)
0 Karma
1 Solution

FrankVl
Ultra Champion

What happens when you simply remove the endswith=("component status up") part from your original query and adjust your where clause to either look for durations >600, or transactions without an "UP" event that are of a certain age?

View solution in original post

FrankVl
Ultra Champion

What happens when you simply remove the endswith=("component status up") part from your original query and adjust your where clause to either look for durations >600, or transactions without an "UP" event that are of a certain age?

alexeyglukhov
Path Finder

Hi Frank,

Thank you very much for your suggestion - it works.
I just recently started playing with Splunk and didn't know that "endwith=" is not a mandatory parameter for transaction.

So, I modified the string as you pointed:

| transaction myvip pool_name_and_port mcpd_code
startswith=("component status down")
| eval is_up_event_found=if(like(_raw, "%component status up%"), "yes", "no")
| where duration>600 OR is_up_event_found="no"

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...