Splunk Search

simple moving average and dynamic fields

jlvix1
Communicator

This will be very interesting or boring, it can only be one!

I have an extracted field: CFErrorCodeMessagesCode

This can contain one of many possible values, e.g. "Code (216)" "Code (9999)" e.g. "Code (xxxx)"

Normally, I have a spreadsheet that creates me a large query to run, for an alert on a cron schedule, that holds a row for each code. e.g with 3 codes the string it generates looks like this:

earliest=-10d index="cbeprodvteclogs" CFErrorCodeMessagesCode="Code (1)" | timechart count(CFErrorCodeMessagesCode) as cnt | trendline sma5(cnt) as cnt2 | eval spike=if(if(cnt <= 10, 0, cnt) > cnt2 * 3 , 1, 0) | eval CFECM="Code (1)" | FIELDS CFECM, cnt, cnt2, spike | append [ search 
earliest=-10d index="cbeprodvteclogs" CFErrorCodeMessagesCode="Code (106)" | timechart count(CFErrorCodeMessagesCode) as cnt | trendline sma5(cnt) as cnt2 | eval spike=if(if(cnt <= 10, 0, cnt) > cnt2 * 3 , 1, 0) | eval CFECM="Code (106)" | FIELDS CFECM, cnt, cnt2, spike ] | append [ search 
earliest=-10d index="cbeprodvteclogs" CFErrorCodeMessagesCode="Code (9999)" | timechart count(CFErrorCodeMessagesCode) as cnt | trendline sma5(cnt) as cnt2 | eval spike=if(if(cnt <= 10, 0, cnt) > cnt2 * 3 , 1, 0) | eval CFECM="Code (9999)" | FIELDS CFECM, cnt, cnt2, spike ] | where strftime(_time, "%Y-%m-%d") = strftime(now(), "%Y-%m-%d") and spike=1

The idea is to alert us of obscure spikes using sma5 over a 10 day data window, it's very effective and even better when sourced from a spreadsheet as I can tune every aspect of it via the columns.

Another alert I have is for codes the application pushes out that aren't in the SMA query (above).

I have 34 codes on one sheet/alert, this runs ok, on another sheet I have almost 60, and the performance has degraded since building up the list, and also since ingesting more data in to the index, but I don't see how this matters because I am explicitly specifying the code I need to pull.

I was considering putting the data to its own index, but this would complicate all sorts of stuff I've developed.

What I was wondering was: Is it possible to just have a single line that dynamically builds me a list? Something like the below query (which doesn't work, nor other variations I've tried)

earliest=-10d index="cbeprodswtlogs" CFErrorCodeMessagesCode=* | timechart count as cnt by CFErrorCodeMessagesCode | trendline sma5(cnt) as cnt2 | eval spike=if(if(cnt <= 50, 0, cnt) > cnt2 * 2 , 1, 0) | FIELDS CFErrorCodeMessagesCode, cnt, cnt2, spike

This would allow me to not have to maintain a spreadsheet and a separate alert for codes that aren't included already, it may also speed it up.

Thanks in advance.

1 Solution

jlvix1
Communicator

I finally figured this out with the help of somerford associates:

earliest=-10d index="cbeprodvteclogs" CFErrorCodeMessagesCode=* |
timechart limit=0 count as cnt by CFErrorCodeMessagesCode |
untable _time CFErrorCodeMessagesCode cnt |
sort 0 CFErrorCodeMessagesCode _time
| autoregress cnt P=1-4 | autoregress CFErrorCodeMessagesCode P=1-4
| eval cnt2=(round(cnt,1) + if(CFErrorCodeMessagesCode_p1==CFErrorCodeMessagesCode,cnt_p1,null()) + if(CFErrorCodeMessagesCode_p2==CFErrorCodeMessagesCode,cnt_p2,null()) +if(CFErrorCodeMessagesCode_p3==CFErrorCodeMessagesCode,cnt_p3,null()) +if(CFErrorCodeMessagesCode_p4==CFErrorCodeMessagesCode,cnt_p4,null())  ) / 5
  | eval spike=if(if(cnt <= 10, 0, cnt) > cnt2 * 2 , 1, 0)
  | FIELDS CFErrorCodeMessagesCode, cnt, cnt2, spike

The query is modifiable as well and can work with any field (host, sourcetype etc..), feel free to use on your own projects...

The removal of bin and use of timechart with untable fixes the issue above.

Thanks all.

View solution in original post

0 Karma

jlvix1
Communicator

I finally figured this out with the help of somerford associates:

earliest=-10d index="cbeprodvteclogs" CFErrorCodeMessagesCode=* |
timechart limit=0 count as cnt by CFErrorCodeMessagesCode |
untable _time CFErrorCodeMessagesCode cnt |
sort 0 CFErrorCodeMessagesCode _time
| autoregress cnt P=1-4 | autoregress CFErrorCodeMessagesCode P=1-4
| eval cnt2=(round(cnt,1) + if(CFErrorCodeMessagesCode_p1==CFErrorCodeMessagesCode,cnt_p1,null()) + if(CFErrorCodeMessagesCode_p2==CFErrorCodeMessagesCode,cnt_p2,null()) +if(CFErrorCodeMessagesCode_p3==CFErrorCodeMessagesCode,cnt_p3,null()) +if(CFErrorCodeMessagesCode_p4==CFErrorCodeMessagesCode,cnt_p4,null())  ) / 5
  | eval spike=if(if(cnt <= 10, 0, cnt) > cnt2 * 2 , 1, 0)
  | FIELDS CFErrorCodeMessagesCode, cnt, cnt2, spike

The query is modifiable as well and can work with any field (host, sourcetype etc..), feel free to use on your own projects...

The removal of bin and use of timechart with untable fixes the issue above.

Thanks all.

0 Karma

DalJeanis
Legend

"how this matters" would depend on whether the CFErrorCodeMessagesCode was extracted at index time or discovered at search time.

0 Karma

jlvix1
Communicator

Extracted at index time 🙂

0 Karma

DalJeanis
Legend

The more tests you do-- that is, the more values from the index that you are going to pull--then the less effiective the index will be. Every non-matching index value could be tested against more individual tests. Or might not, depending on how smart the search optimizer is.

At some point, it becomes more efficient for the system to just calculate ALL the values, and then select the sums you want.

0 Karma

jlvix1
Communicator

There is no other way as I need to get the SMA of these values. The query doesn't halt, I think if I need to stick to the original method then Iwill cut down the codes/lines in the query.

0 Karma

DalJeanis
Legend

Try this -

 earliest=-10d index="cbeprodswtlogs" CFErrorCodeMessagesCode=* 
| timechart count as cnt by CFErrorCodeMessagesCode 
| sort 0 CFECM _time 
| autoregress cnt P=1-5 | autoregress CFECM P=1-5 
| eval cnt2=(round(cnt,1) + if(CFECM_p1==CFECM,cnt_p1,null()) + if(CFECM_p2==CFECM,cnt_p2,null()) +if(CFECM_p3==CFECM,cnt_p3,null()) +if(CFECM_p4==CFECM,cnt_p4,null())  )  /5
| eval spike=if(if(cnt <= 50, 0, cnt) > cnt2 * 2 , 1, 0) 
| FIELDS CFErrorCodeMessagesCode, cnt, cnt2, spike

Lines 3-5 are a manual method to simulate the sma5. First, we make sure the records are sorted into CFECM and time order. Second, Autoregress puts the values from the previous 4 records onto this record. Finally, we laboriously make sure we are dealing with the same CFECM. The following test code shows that it achieves the same values as sma5.

| makeresults| eval CFECM="Code (1)" | eval cnt="10 20 30 40 50 60 70 80 90 100" 
| makemv cnt | mvexpand cnt 
| sort 0 CFECM _time 
| autoregress cnt P=1-4 | autoregress CFECM P=1-4
| eval cnt2=(round(cnt,1) + if(CFECM_p1==CFECM,cnt_p1,null()) + if(CFECM_p2==CFECM,cnt_p2,null()) +if(CFECM_p3==CFECM,cnt_p3,null()) +if(CFECM_p4==CFECM,cnt_p4,null())  )  /5
| trendline sma5(cnt) as cnt2sma
| table CFECM cnt cnt2 cnt2sma
0 Karma

jlvix1
Communicator

I've made a slight adjustment to the autoregress part, the splunk docs make out that we should have P=1-4 with the additions that work out cnt2, otherwise we need to add CFErrorCodeMessagesCode_p5 to it?

See example 3 here: https://docs.splunk.com/Documentation/Splunk/6.5.2/SearchReference/Autoregress

0 Karma

jlvix1
Communicator

This is almost working but there is quite a snag which I think I may be able to work around by some tweaks and maybe interpreting the results differently

There is a key problem I've tried to work around but failing on 2 different attempts due to there being zero of a specific code on certain days:
1) tried using fillnull
2) tried using stats sum(eval(if(isnull(_time),0,1))) as cnt by _time CFErrorCodeMessagesCode

The reason is because in my original query I'm forcing the code in hence splunk forcefully gives a row for each code on each day, e.g:

eval CFECM="Code (1)" | FIELDS CFECM

This results in a line being created for a particular day which is when empty is something like:

Code (1) 0 null/empty 0 2017-02-28 (for first 4) **or** Code (1) 0 0 0 2017-02-28 (for the rest)

With the new query, I'm unable to work out Code (x) because Code (x) only comes out on 21, 22, 25 & 28 of Feb, there are no zero days in between for the SMA to be viable.

If I could post a screenshot I would, it's quite obvious when looking.

For the ones where the code happens at least once on each day for 10 consecutive days, this is pefect and works brilliantly.

The autoregress function is having some issues with addition of null in a prior day where a skipped day is missing:

Code (203) 2 N 0 2017-02-20
Code (203) 22 N 0 2017-02-21
Code (203) 6 N 0 2017-02-23
Code (203) 8 N 0 2017-02-24
Code (203) 4 8.40 0 2017-02-25
Code (203) 2 8.40 0 2017-02-28 < Wrong Result due to null p1 & p2? Trying to refer to 26th and 27th and not the previous row, as seen? Bug? day bin issue?

When replacing null() with 0 I get lots of spike=1 and the first 4 cnt2 rows are calculated, I now understand the autoregress null addition is there to cause the first 4 cnt2 results to be null themselves.

One (crazy) workaround for this would be to artifically feed one of each code in to the system after midnight, lol although I'm not inclined to go so rogue and the boss would not allow it.

0 Karma

jlvix1
Communicator
earliest=-10d index="cbeprodswtlogs" CFErrorCodeMessagesCode=*
 | bin _time span=1d
 | stats count as cnt by _time CFErrorCodeMessagesCode
 | sort 0 CFErrorCodeMessagesCode _time
 | autoregress cnt P=1-4 | autoregress CFErrorCodeMessagesCode P=1-4
 | eval cnt2=(round(cnt,1) + if(CFErrorCodeMessagesCode_p1==CFErrorCodeMessagesCode,cnt_p1,null()) + if(CFErrorCodeMessagesCode_p2==CFErrorCodeMessagesCode,cnt_p2,null()) +if(CFErrorCodeMessagesCode_p3==CFErrorCodeMessagesCode,cnt_p3,null()) +if(CFErrorCodeMessagesCode_p4==CFErrorCodeMessagesCode,cnt_p4,null())  ) / 5
 | eval spike=if(if(cnt <= 10, 0, cnt) > cnt2 * 2 , 1, 0)
 | FIELDS CFErrorCodeMessagesCode, cnt, cnt2, spike 
0 Karma

jlvix1
Communicator

Hi, the first query gives me the same results as the one I posted, it populates with 10 blank rows.

The second query works to demonstrate the method, let me have a play with this and see what I can achieve.

I will have a play, I noticed you used CFECM which is a shortened alias, will replace that with the field name...

thanks, this all looks very good!

0 Karma

DalJeanis
Legend

Yep, I like to hand people everything they need to understand the method and test with it.

Populates with ten blank rows? Hmmm. If that's not right, then replace this command:

 | timechart count as cnt by CFErrorCodeMessagesCode 

with something like this (pick your own time factor day, week, month, whatever):

 | bin _time span=1d
 | stats count as cnt by _time CFErrorCodeMessagesCode  
0 Karma

jlvix1
Communicator

I'm just toying about with this now, seems that where cnt=0 it's not giving results for that day, so I will tune this a bit as it needs the first 5 days to work out the SMA.

0 Karma

jlvix1
Communicator

I tried replacing null() with 0, results look wrong, lots of spike=1 and days are still missing. How can you perform addition on null()? I think there are no codes for that day so think I need to default/substitute where cnt for that day is zero.

Most languages I know of don't consider null in math ops and cause the result to be null via all math operators.

The sma5 is certainly not working because of not having 5 days worth of data, as days where cnt=0 is missing. I will continue to work on this in the meantime. 🙂

0 Karma

jlvix1
Communicator

I have just tried and got a good result but using this query:)))))

earliest=-10d index="cbeprodswtlogs" CFErrorCodeMessagesCode=*
| bin _time span=1d
| stats count as cnt by _time CFErrorCodeMessagesCode

| sort 0 CFErrorCodeMessagesCode _time
| autoregress cnt P=1-5 | autoregress CFErrorCodeMessagesCode P=1-5
| eval cnt2=(round(cnt,1) + if(CFErrorCodeMessagesCode_p1==CFErrorCodeMessagesCode,cnt_p1,null()) + if(CFErrorCodeMessagesCode_p2==CFErrorCodeMessagesCode,cnt_p2,null()) +if(CFErrorCodeMessagesCode_p3==CFErrorCodeMessagesCode,cnt_p3,null()) +if(CFErrorCodeMessagesCode_p4==CFErrorCodeMessagesCode,cnt_p4,null()) ) /5
| eval spike=if(if(cnt <= 50, 0, cnt) > cnt2 * 2 , 1, 0)
| FIELDS CFErrorCodeMessagesCode, cnt, cnt2, spike

Just a note on this to others, the CFErrorCodeMessagesCode field could easily be replaced with any field in splunk to act on that field, such as sourcetype to monitor for specific things lighting up or host... Try it.

The beauty of solving this problem is that the code can be re-used to work out the moving average of any field in an intelligent way with minimal overhead, potentially opening the gates for lots of people to make good use of this example for their alerting and spike monitoring. The SMA is very reliable especially with a floor value (nested if's) to suppress false positives. 🙂

Thanks.

0 Karma

jlvix1
Communicator

I mean to say, dynamically analyzes the data - one line instead of lots of explicit lines

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...