Splunk Search

How to detect a flapping load balancer - up to down to back 5x times?

amoshos
Loves-to-Learn

Hi all,

First time posting here so please be patient and I am relatively new to the Splunk environment, but I am struggling to figure out this search function.

My manager has asked me to create an alert for Load Balancers flapping on our server.

Criteria;
- Runs every 15 mins (I assume this can be set in the "alert" settings)
- Fires if a load balancer switches from Up to Down and Back more than 5 times

This second point I am struggling to work out - this is what I have so far;

 

 

 

 

index=xxx  sourcetype="xxx" host="xxx" (State=UP OR State=DOWN) State="*"
| stats count by State
| eval state_status = if(DOWN+UP == 5, "Problem", "OK")
| stats count by state_status

 

 

 

 

 

Note - "State" is the field in question as it stores the UP/DOWN events which have values.


Based on this, I can get an individual count of when the load balancer displayed UP and when it displayed DOWN, however I need to turn this into a threshold search to only display a count of how many times it changed from UP to DOWN 5x consecutive times.

Any and all help will be much appreciated.

Labels (4)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

@amoshos 

If you are looking to count transitions, then use streamstats. 

Note examples like this that use | makeresults are generally designed to show you how you can achieve something.

This is a simple example you can run in the search window, which will create alternating events over 15 minutes. It achieves this by

  • sorting the events in time order, so the earliest on comes first
  • adding the two adjacent event states into a new field called states, that contains the previous event and the current event state
  • checks if the two states are in the order UP->DOWN, indicating the previous state was up and the new state is down (value 1) or value 0 if not
  • sum all the value 1 states from above
| makeresults count=15
| streamstats c
| eval _time=now() - (c * 60)
| eval state=if(c%2=1, "UP", "DOWN")
| sort _time
| streamstats window=2 list(state) as states
| eval transition=if(mvjoin(states,":")="UP:DOWN", 1, 0)
| stats sum(transition) as flaps

Note this is done so that if you run this example

| makeresults count=15
| streamstats c
| eval _time=now() - (c * 60)
| eval state=if(c%3!=0, "UP", "DOWN")
| sort _time
| streamstats window=2 list(state) as states
| eval transition=if(mvjoin(states,":")="UP:DOWN", 1, 0)
| stats sum(transition) as flaps

which makes an UP, UP, DOWN, UP, UP, DOWN sequence, the it will only treat the UP/DOWN as a transition, and ignore the additional UP messages.

Note also if you want to start doing this by host, then your streamstats would look like this

| streamstats window=2 global=f list(state) as states by host

and also by host on the final stats.

0 Karma

amoshos
Loves-to-Learn

Hi @bowesmana 

Appreciate the long and in-depth response, however I'm not sure how to apply that to my scenario (relatively new Splunk user).

My manager has advised this process is too complicated and simply a count of up and down events by the load balancer / VIP in question and a threshold search is all that is needed. Not too sure how to apply your examples below to my scenario....

0 Karma

bowesmana
SplunkTrust
SplunkTrust

If you just want to count # of up and # of down messages regardless of sequence and the total is more than 5 regardless of whether it is 4 up 1 down or vice-versa, then

index=xxx  sourcetype="xxx" host="xxx" (State=UP OR State=DOWN) 
| stats count 

 will just give you a count, but I assume you need to have some logic in there that determines if down is > 0

so there are lots of ways to do this, but simply

index=xxx  sourcetype="xxx" host="xxx" (State=UP OR State=DOWN) 
| stats count by State
| transpose header_field=State
| where DOWN>2 AND UP>2

or

index=xxx  sourcetype="xxx" host="xxx" (State=UP OR State=DOWN) 
| chart count over host by State
| where DOWN>2 AND UP>2

Depending on your data these may be OK, but hopefully will give you a way to make it work for you

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...