I am currently sending all cisco ace load balancer syslogs to my splunk server.
Within Splunk, I have two separate real-time alerts - one alert notifies me via email when a certain server goes down and a separate alert notifies me when the server comes back up.
Is it possible to create a custom alert where I will only be notified if the server does not come back up after being down for more than X amount of hours? Receiving up down alerts is very annoying and sometimes there are so many emails, I wouldn't know if an up alert matches a down alert.
If this is possible, how would I go about implementing it? Thanks
To provide a little more detail, here is exactly what my real-time alerts look like:
Alert 1 - "Particular Server Name" Changed State to DOWN - send email
Alert 2 - "Particular Server Name" Changed State to UP - send email
Where the server name is an arbitrary name of a server that wouldn't mean anything to anybody
even if I did copy it directly from my alert.
Sometimes the patching team fails to bring up a server properly and we find out the hard way when somebody complains. I actually have dozens of alerts just like this but for different servers. However, one solution would apply for all of my alerts.
Try this search:
"Health Probe" "changed state to" | rex "Health\sProbe\s(?<probe_name>[^_]+)_ | rex "changed\sstate\to\s(?<state>[^\$]+)$ | transaction fields="probe_name,state" startswith=UP endswith=DOWN keepevicted=t | search duration > 10800
sorry, I forgot my closing quotes on the rex commands (or else Answers ate them). at the end of each rex command, just before the pipes, put a closing quote. You should be adding two quotes: one after +)_ and one after +)$
Error in 'SearchParser': Missing a search command before '^'.
I assume that there are events that show a down message? And that they're pretty much the same text as the up messages you posted (only with a "DOWN" at the end)?
So given an up message of this:
[Date] [Time] [Server IP] : [Tag]: Health Probe NY_HTTP:80_PROBE detected Server Name in serverfarm NY_Serverfarm_01 changed state to UP
and a down message of this:
[Date] [Time] [Server IP] : [Tag]: Health Probe NY_HTTP:80_PROBE detected Server Name in serverfarm NY_Serverfarm_01 changed state to DOWN
And you want to create a transaction based on an up message followed by a down message for the probe name (i.e. "NY")? Is that correct? If so, you'd want something like this:
"Health Probe" "changed state to" | rex "Health\sProbe\s(?<probe_name>[^_]+)_ | rex "changed\sstate\to\s(?<state>[^\$]+)$ | transaction fields="probe_name,state" maxspan=180m startswith=UP endswith=DOWN keepevicted=t
Hope that helps!
Your assumptions are correct, there is an UP message for every DOWN... atleast there should be
Specifically, I want an email sent out if an UP message is not received within 3 hours of seeing a DOWN message. This way admins can take action and bring it back up properly.
Mike, per our discussion, here is what an actual log in splunk looks like.
[Date] [Time] [Server IP] : [Tag]: Health Probe NY_HTTP:80_PROBE detected Server Name in serverfarm NY_Serverfarm_01 changed state to UP
Another example would be this:
[Date] [Time] [Server IP] : [Tag]: Health Probe NY_HTTP:8080_PROBE detected Server Name in serverfarm NY_Serverfarm_02 changed state to UP
So you can see, we need "NY" and UP or DOWN to be extracted so it can be called out within your transaction field expression. We cant use server farms or server names because there are too many but the beginning of the probe is always the same - NY in this case.
"Particular Server Name" "Changed State to" (DOWN OR UP) | transaction fields="dvc,state" maxspan=180m startswith=Down endswith=Up keepevicted=t
This assumes a few things:
If that search works correctly, save it and set up an alert.
Hope that helps.
By the way, if dvc and state aren't being extracted, you can do that within your search.
"Particular Server Name" "Changed State to" (DOWN OR UP) | rex "^(
The cisco ace module has probe's configured on the device to check the status of any particular server. That probe information is generated in the syslogs. My alert's are based off of the probes that I see in splunk.
How do you know the server is down? I meant is there anything you do to know the status?