Splunk Search

How to search if a component has been down for some time and the event that it is up hasn't appeared for that period ?

alexeyglukhov
Path Finder

Hello all !
The task is to alert if a component (pool) is down for more than 10 minutes.

Some details:
There are down and up events for many pools (entry examples are below).
So firstly I started using a transaction to find those pairs of events and compare the duration to the threshold - works perfectly (the search string is below).

| transaction myvip pool_name_and_port mcpd_code
startswith=("component status down")
endswith=("component status up")
| where duration>600

But the problem is in order transaction to appear in the alert I always need the second message "up" (and delays can be hours/days), so it doesn't help to generate alerts immediately after the pool was down for more than 10 minutes.

So, I need your help to detect that a component was down for more than some period of time (10 minutes in my case) and there is no "component status up" event in that period of time.

Thank you in advance for any help !

Entry examples:
Feb 26 21:11:23 log_file_location_hostname_1 notice mcpd[9999]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:15mins:39sec ]
Feb 26 21:11:22 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:15mins:35sec ]
Feb 26 21:11:22 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:15mins:35sec ]
Feb 26 21:11:21 log_file_location_hostname_3 notice mcpd[7777]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:15mins:34sec ]
Feb 26 21:11:21 log_file_location_hostname_3 notice mcpd[7777]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:15mins:34sec ]
Feb 26 21:00:37 log_file_location_hostname_1 notice mcpd[9999]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:45sec ]
Feb 26 21:00:37 log_file_location_hostname_4 notice mcpd[6666]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:45sec ]
Feb 26 21:00:37 log_file_location_hostname_4 notice mcpd[6666]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:45sec ]
Feb 26 21:00:36 log_file_location_hostname_1 notice mcpd[9999]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:40sec ]
Feb 26 21:00:36 log_file_location_hostname_1 notice mcpd[9999]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:40sec ]
Feb 26 21:00:35 log_file_location_hostname_3 notice mcpd[7777]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:35 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:35 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:34 log_file_location_hostname_3 notice mcpd[7777]: 01070727:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10004 component status up. [ /Common/myvip1/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:34 log_file_location_hostname_4 notice mcpd[6666]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:40sec ]
Feb 26 21:00:33 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 21:00:33 log_file_location_hostname_2 notice mcpd[8888]: 01070727:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10004 component status up. [ /Common/myvip2/appname_https_component: up ] [ was down for 0hr:17mins:39sec ]
Feb 26 20:55:47 log_file_location_hostname_2 notice mcpd[8888]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:47 log_file_location_hostname_3 notice mcpd[7777]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:47 log_file_location_hostname_2 notice mcpd[8888]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:47 log_file_location_hostname_3 notice mcpd[7777]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:46 log_file_location_hostname_4 notice mcpd[6666]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:46 log_file_location_hostname_4 notice mcpd[6666]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost2:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:26sec ]
Feb 26 20:55:45 log_file_location_hostname_1 notice mcpd[9999]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:22sec ]
Feb 26 20:55:45 log_file_location_hostname_4 notice mcpd[6666]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:45 log_file_location_hostname_2 notice mcpd[8888]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:45 log_file_location_hostname_4 notice mcpd[6666]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:45 log_file_location_hostname_2 notice mcpd[8888]: 01070638:5: Pool /Common/myvip2/appname_pool member /Common/myhost4:10003 component status down. [ /Common/myvip2/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:45 log_file_location_hostname_1 notice mcpd[9999]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost3:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:22sec ]
Feb 26 20:55:44 log_file_location_hostname_3 notice mcpd[7777]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:44 log_file_location_hostname_1 notice mcpd[9999]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:22sec ]
Feb 26 20:55:44 log_file_location_hostname_3 notice mcpd[7777]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:21sec ]
Feb 26 20:55:44 log_file_location_hostname_1 notice mcpd[9999]: 01070638:5: Pool /Common/myvip1/appname_pool member /Common/myhost1:10003 component status down. [ /Common/myvip1/appname_https_component: down ] [ was up for 0hr:1min:22sec ]

Tags (2)
0 Karma
1 Solution

FrankVl
Ultra Champion

What happens when you simply remove the endswith=("component status up") part from your original query and adjust your where clause to either look for durations >600, or transactions without an "UP" event that are of a certain age?

View solution in original post

FrankVl
Ultra Champion

What happens when you simply remove the endswith=("component status up") part from your original query and adjust your where clause to either look for durations >600, or transactions without an "UP" event that are of a certain age?

alexeyglukhov
Path Finder

Hi Frank,

Thank you very much for your suggestion - it works.
I just recently started playing with Splunk and didn't know that "endwith=" is not a mandatory parameter for transaction.

So, I modified the string as you pointed:

| transaction myvip pool_name_and_port mcpd_code
startswith=("component status down")
| eval is_up_event_found=if(like(_raw, "%component status up%"), "yes", "no")
| where duration>600 OR is_up_event_found="no"

Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...