Trendline to work grouped by field

Esperteyu · ‎10-14-2018

Hi,

My intention is to measure the 2 hour moving average of the events with X201 reason code ratio compared to the total of info events logged per country:

index="logger" "Log Info"
  | bin _time span=1h
  | eval _raw=replace(_raw,"\\\\\"","\"")
  | rex "\"RawRequest\":\"(?<raw_request>.+)\"}$"
  | eval json= raw_request
  | spath input=json output=country_code path=customer.billingAddress.countryCode
  | spath input=json output=reason_code path=history{}.reasonCode
  | where country_code = "USA" 
  | eval error=if(like(reason_code,"%X201%"),1,0) 
  | stats count as total sum(error) as errors by _time, country_code
  | trendline sma2(errors) as 2_hours_error_sma sma2(total) as 2_hours_total_sma
  | table _time, country_code, total, errors, 2_hours_error_sma, 2_hours_total_sma

In there I added the where country_code="USA" to double check that the results were ok and they are, but if I remove that clause the results get messed up and I'm struggling to understand why as I would expect the data to be grouped properly.

N.B The table clause is just for purposing of checking the data, eventually my idea would be to use it for detecting spikes of the ratio of errors/total but that would be a further step

EDIT: Turns out streamstats honours the split field (and in conjuction to global=f I get what I expect), I leave it here for reference and just in case someone proves me wrong but I think this works. I found it in the answer to 417618/how-to-search-and-alert-on-anomaliesspikes-in-mult (can't paste links, but that is a great answer)

 index="logger" "Log Info"
   | bin _time span=1h
   | eval _raw=replace(_raw,"\\\\\"","\"")
   | rex "\"RawRequest\":\"(?<raw_request>.+)\"}$"
   | eval json= raw_request
   | spath input=json output=country_code path=customer.billingAddress.countryCode
   | spath input=json output=reason_code path=history{}.reasonCode
   | eval error=if(like(reason_code,"%X201%"),1,0) 
   | stats count as total sum(error) as errors by _time, country_code
   | streamstats window=2 mean(total) as sma2_total mean(errors) as sma2_errors by country_code global=false
   | table _time, country_code, total, errors, sma2_total, sma2_errors

DalJeanis · ‎10-16-2018

Your solution can work. I have some concerns about using the binned counts as your simple moving average... but that's going to depend on the scale of what you are looking at. It should be fine if you are looking over multiple days. On the other hand, if you want a more granular count, you might try something more like this...

index="logger" "Log Info"
    | eval _raw=replace(_raw,"\\\\\"","\"")
    | rex "\"RawRequest\":\"(?<raw_request>.+)\"}$"
    | eval json= raw_request
    | spath input=json output=country_code path=customer.billingAddress.countryCode
    | spath input=json output=reason_code path=history{}.reasonCode
    | eval error=if(like(reason_code,"%X201%"),1,0) 
    | streamstats time_window=2h count as RunningCount sum(error) as RunningError by country_code
    | bin _time span=1h
    | stats count as total  sum(error) as errors  avg(RunningCount) as sma2_total avg(RunningError) as sma2_error  by _time, country_code
    | table _time, country_code, total, errors, sma2_total, sma2_errors

Esperteyu · ‎10-16-2018

@DalJeanis your query doesn't seem to work. For instance gives these results (I omit country_code and time for space)

total   errors  sma2_total                 sma2_error
61        0     78.78688524590164          1
   38         1     49.421052631578945         0.39473684210526316
   19         0     29.57894736842105          0
   16         0     23.9375                                0
   9          0     18.77777777777778          0
  12          0     10.5                                           0
 4                0     2.5                                        0

In any case, and to give more context, this is a way of getting my head around the problem I'm trying to solve. I will need to alert on spikes on the error/total ratio per country compared to the 2 hour moving average of that measure, so I don't think binning them is the solution as I will be running the query every 10 minutes. Your approach seems more promising as, please correct me if I'm wrong, I will be "storing" the number events and error for the previous 2 hours so then I could just use the ratio on those ten minutes to compare against the one coming from those two numbers

Trendline to work grouped by field

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Accelerating Observability as Code with the Splunk AI Assistant

Join the Conversation

Trendline to work grouped by field

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Accelerating Observability as Code with the Splunk AI Assistant