I want to receive notifications if agents lower or exceed their normal activity for the current day of the week and hour.
I'm going to achieve this by creating a daily lookup table with the counted perc90 number of events per day of the week / hour
The search to create daily lookup table looks like this
index=api* | stats count(eval(result=="OK")) as success, count(eval(result!="OK")) as fail by agent, method, date_month, date_mday, date_wday, date_hour, date_minute | stats min(success) as min_success, min(fail) as min_fail, perc90(success) as perc90_success, perc90(fail) as perc90_fail, max(success) as max_success, max(fail) as max_fail by agent, method, date_wday, date_hour, date_minute | strcat agent "@" method "@" date_wday date_hour ":" date_minute key | fields key, min_success, min_fail, perc90_success, perc90_fail, max_success, min_fail | outputlookup agent_limits append=False key_field=key
But now I'm in doubt how to move on.
How to get overview filling that agents activity in normal ranges?
should it be alert or dashboard or pivot?
how to make a correction manually, if the agent has gone beyond the ranges of his normal activity, but this is the same norm or just growth over time?
You're a bit deep into the weeds. The first thing you need to do is understand your data.
What you have there is the minimum, maximum, and 90th percentile of activity for that particular hour and minute of that day of the week. Obviously, I don't know your data, but is that really a reasonable representation of how your data changes? Is 4 AM on Monday that much different from 4 AM on Tuesday? Is 4:03 AM on Wednesday really that different from 4:05 AM on Wednesday?
First, run this and look at the results. What is the shape of your data?
index = api* agent=* method=*
| bin _time span=1m
| stats count(eval(result=="OK")) as success, count(eval(result!="OK")) as fail by agent, method, date_wday, date_hour, date_minute, _time
| eval total= success+fail
| eval good = round(100*success/total,0)
| eval bad = round(100*fail/total,0)
| stats min(success) as min_success, min(fail) as min_fail, min(total) as min_total,
perc90(success) as perc90_success, perc90(fail) as perc90_fail, perc90(total) as perc90_total,
max(success) as max_success, max(fail) as max_fail, max(total) as max_total,
avg(success) as avg_success, avg(fail) as avg_fail, avg(total) as avg_total,
stdev(success) as stdev_success, stdev(fail) as stdev_fail, stdev(total) as stdev_total,
perc90(good) as perc90_good, perc90(bad) as perc90_bad,
min(good) as min_good, min(bad) as min_bad,
max(good) as max_good, max(bad) as max_bad,
avg(good) as avg_good, avg(bad) as avg_bad,
stdev(good) as max_good, stdev(bad) as stdev_bad,
max(_time) as _time
by agent, method, date_wday, date_hour, date_minute
| table _time agent method min_* perc* max_* avg* stdev* date*
| outputcsv mytest.csv
That gives you a csv file of your results for this particular moment in time, back one full week. Then you can use the csv of results to explore the data. That will give you a picture of your average week. From there, there are literally hundreds of ways to go, depending on what you find.
You are trying to understand the features of your data, and find out what is most important. How does the total number of transactions ebb and flow? Are there times of day that are different in character? Are there agents or methods that are more volatile than others? More predictable?
Is the number of transactions that are good most important, or the percentage?
How does the 90th percentile relate to the avg and the stdev? Basically, this is the z score of the 90th percentile | eval z_x=round( (perc90_x - avg_x)/stdev_x , 2)
That's usually going to be about 1.28, but the amount by which it is different might tell you something about the distribution you are looking at.