Splunk Search

How to calculate limits for agents based on their activity?

exmuzzy
Explorer

I want to receive notifications if agents lower or exceed their normal activity for the current day of the week and hour.
I'm going to achieve this by creating a daily lookup table with the counted perc90 number of events per day of the week / hour

The search to create daily lookup table looks like this

index=api* | stats count(eval(result=="OK")) as success, count(eval(result!="OK")) as fail by agent, method, date_month, date_mday, date_wday, date_hour, date_minute | stats min(success) as min_success, min(fail) as min_fail, perc90(success) as perc90_success, perc90(fail) as perc90_fail, max(success) as max_success, max(fail) as max_fail by agent, method, date_wday, date_hour, date_minute | strcat agent "@" method "@" date_wday date_hour ":" date_minute key | fields key, min_success, min_fail, perc90_success, perc90_fail, max_success, min_fail | outputlookup agent_limits append=False key_field=key

But now I'm in doubt how to move on.
How to get overview filling that agents activity in normal ranges?
should it be alert or dashboard or pivot?

how to make a correction manually, if the agent has gone beyond the ranges of his normal activity, but this is the same norm or just growth over time?

0 Karma

DalJeanis
Legend

You're a bit deep into the weeds. The first thing you need to do is understand your data.

What you have there is the minimum, maximum, and 90th percentile of activity for that particular hour and minute of that day of the week. Obviously, I don't know your data, but is that really a reasonable representation of how your data changes? Is 4 AM on Monday that much different from 4 AM on Tuesday? Is 4:03 AM on Wednesday really that different from 4:05 AM on Wednesday?

First, run this and look at the results. What is the shape of your data?

index = api*  agent=* method=*
| bin _time span=1m
| stats count(eval(result=="OK")) as success, count(eval(result!="OK")) as fail by agent, method, date_wday, date_hour, date_minute, _time
| eval total= success+fail
| eval good = round(100*success/total,0)
| eval bad = round(100*fail/total,0)
| stats min(success) as min_success, min(fail) as min_fail, min(total) as min_total, 
    perc90(success) as perc90_success, perc90(fail) as perc90_fail, perc90(total) as perc90_total, 
    max(success) as max_success, max(fail) as max_fail, max(total) as max_total, 
    avg(success) as avg_success, avg(fail) as avg_fail, avg(total) as avg_total, 
    stdev(success) as stdev_success, stdev(fail) as stdev_fail, stdev(total) as stdev_total, 
    perc90(good) as perc90_good, perc90(bad) as perc90_bad,
    min(good) as min_good, min(bad) as min_bad,
    max(good) as max_good, max(bad) as max_bad,
    avg(good) as avg_good, avg(bad) as avg_bad,
    stdev(good) as max_good, stdev(bad) as stdev_bad,
    max(_time) as _time
    by agent, method, date_wday, date_hour, date_minute 
| table _time agent method min_* perc* max_* avg* stdev* date*
| outputcsv mytest.csv 

That gives you a csv file of your results for this particular moment in time, back one full week. Then you can use the csv of results to explore the data. That will give you a picture of your average week. From there, there are literally hundreds of ways to go, depending on what you find.

You are trying to understand the features of your data, and find out what is most important. How does the total number of transactions ebb and flow? Are there times of day that are different in character? Are there agents or methods that are more volatile than others? More predictable?

0 Karma

DalJeanis
Legend

Is the number of transactions that are good most important, or the percentage?

How does the 90th percentile relate to the avg and the stdev? Basically, this is the z score of the 90th percentile | eval z_x=round( (perc90_x - avg_x)/stdev_x , 2) That's usually going to be about 1.28, but the amount by which it is different might tell you something about the distribution you are looking at.

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...