Re: error trend stats with multiple error codes fr...

kumar497 · ‎01-04-2023

Hi All,

I am trying to tabulate the error ratio based on the following scenarios from the unique log event but further using the regex to split the error code causing the total events to be filtered out causing the overall hits to be incorrect while % calculation as the initial no of unique events is not getting preserved with eventstats

sample log event is in json format as below and multiple errorcodes in same log event which needs a error wise split

log:<<field1>>,<<field2>>,<<field3>>,error=60KOANEWLH=500.EBS.SYSTEM.100:67MPW4X79FOJ=500.IMS.SERVEROUT.100:3534U6ZIZY39=500.EBS.SERVERIN.100;404.IMS.SERVEROUT.105:3M8TEWEKVIJK=500.IVS.XXXXX.100;404.IMS.XXXX.105:2ILTH9G0UMG1=500.IMS.XXXXXXXX.100:0UAQL48U2KWF=500.EBS.XXXXXXX.100;404.IMS.XXXXXXXXX.105, missingFulfillmentItems,<<field4>>,<<field5>>,<<field6>>

i would like to get each error code % mainly (500.XX.XXXX.100 count/total hits)

below is the splunk search filter been used but not getting totalevents, please correct me if there is anything missed,,could someone please assist with an alternate option to compute the error trend..Thanks in advance

Note: each log event is unique and has multiple error codes with in the event or no error codes in the event if its success

PickleRick · ‎01-06-2023

Phew. That first sentence is lost on me. And lack of capitals and punctuation doesn't help. To be honest, I don't understand what you're trying to do.

And your event is definitely not in json format. Maybe that's one of the json fields, but that's not a json event.

I don't understand the "eventstats sum(index)" part. You're trying to calculate a sum of the string field. That's most probably not what you want. Maybe you simply need "eventstats count"?

BTW, you can probably match your error codes much easier. After getting your errorMap2 just use rex to find all error codes.

| rex max_match=0 field=errorMap2 "=(?<errorcode>\d{3}[A-za-z0-9.]+)"

Now you can do with those matches whatever you want.

kumar497 · ‎01-08-2023

Thanks @PickleRick for your inputs

yeah total event is in json and the field im trying to extract is part of a "log" attribute in the json event

my intention is to compute each errorcode percentage out of the total no of requests that is logged, In this usecase error codes will be either multiple errors in a single event or no errors, so trying to extract each errorcode and also required to preserve total no of requests as "eventstats count" in the initial phase bcz further error filtering of 5xx(with like function) yields in less num of events hope i made the requirement clear

Thanks i can use the below rex for error extraction which is helpful but still i am unable to compute the error % for a larger duration as the jobs getting finalized exceeding the user quota seems im missing something,

PickleRick · ‎01-08-2023

Apart from the obvious techniques which @ITWhisperer already showed - removing fields to reduce memory footprint, you can do something else. Move the calculations elsewhere 😉

Drop the eventstats completely. Add a fake fallback value for those events that do not have any errors at all. Then add a fake "count" which will contain value of 1/(number of errors) per event. This way you can do sum(fake_count) by error with your stats. This will give you proper overall count split among different error values. Now you can do eventstats sum on this fake sum which should give you relatively good approximation (might be a tiny bit off due to rounding errors) to the original count.

Edit: Alternatively, you could substitute eventstats with streamstats count which will "number" your events. Then if you get max(count) in your stats, you can easily choose max of those values to be the sequence number of your last event effectively being the count of your events. And don't forget about the fake error to account for events with no error.

kumar497 · ‎01-11-2023

Thanks @PickleRick for the inputs

ITWhisperer · ‎01-04-2023

Shouldn't

| eventstats sum(index) as total_hits

be

| eventstats count as total_hits

kumar497 · ‎01-06-2023

Thanks @ITWhisperer you are right

I modified the search as below to compute the error ratio but disk usage is exceeding the configured limits getting the job autofinalized showing 0 events with increased time window but working only with very short duration, Is there a way to optimize further can someone please help

ITWhisperer · ‎01-06-2023

By doing the eventstats count early (before the mvexpands) you are counting the events as total_hits. Your ratio is then comparing each error against the number of events, so, for example, you could have 4 events, one of which has 4 500 errors and the other three have no errors. This would give a ratio of 100%. Is this really what you are intending to do?

kumar497 · ‎01-08-2023

yes right the event counts will not match, my intention is to compute each errorcode percentage out of the total no of requests that is logged, In this usecase error codes will be either multiple errors in a single event or no errors, so trying to extract each errorcode and also required to preserve total no of requests as "eventstats count" in the initial phase bcz further error filtering of 5xx(with like function) yields in less num of events hope i made the requirement clear

ITWhisperer · ‎01-08-2023

Try something like this to reduce memory requirements as you go

index=<indexname> "Search String XXXXXXXXX" "Type"=prod  
``` Possibly don't need _raw as you appear to have already extracted log.log? ```
| fields - _raw 
``` Count events with streamstats (to keep the pipeline moving) ```
| streamstats count as row 
| rex field="log.log" ", errorMap=*(?<errorMap2>.+),missingFulfillmentItems"
``` Remove fields when you are finished with them ```
| fields - 'log.log'
| eval errors0=replace(errorMap2, "=", ";")
``` Remove fields when you are finished with them ```
| fields - errorMap2
| eval errors1=split(errors0,":")
``` Remove fields when you are finished with them ```
| fields - errors0
| rex field=errors1 "(?<errorCodes>.*)"
``` Remove fields when you are finished with them ```
| fields - errors1
| mvexpand errorCodes
| eval errorCodes1=split(errorCodes, ";")
``` Remove fields when you are finished with them ```
| fields - errorCodes
| mvexpand errorCodes1
``` Use max(row) as your event count ```
| stats count as error_count max(row) as total_hits by errorCodes1
``` Only keep error counts for the codes you are interested in - moved to after stats so that correct max(row) is maintained ```
| where like(errorCodes1,"%500.IMS.%")
| eval error_ratio=round((error_count/total_hits)*100,2)
| table errorCodes1,error_count,error_ratio,total_hits
| sort -error_ratio

kumar497 · ‎01-11-2023

Thank you so much @ITWhisperer this is really helpful but only constraint was around the mvexpand where the memory limits are reaching without sampling, though its working if including sampling ratio

just would like to understand max(row) is the sequence number of each event after splitting the error codes isnt it, will it not generate too many events ,sorry i might be missing something

stats count as error_count max(row) as total_hits by errorCodes1

ITWhisperer · ‎01-11-2023

Actually, it looks like there might be an extra line needed

``` Use max(row) as your event count ```
| stats count as error_count max(row) as total_hits by errorCodes1
``` Set total_hits to max before filtering out any events ```
| eventstats max(total_hits) as total_hits
``` Only keep error counts for the codes you are interested in - moved to after stats so that correct max(row) is maintained ```

Moving the eventstats to here (after the stats by errorCodes1) means it is only processing the stats events rather than all the events.

Also, you should bear in mind the hint from @PickleRick about having a fake error code so that the last row has an error code that appears in errorCodes1 i.e. non-null

PickleRick · ‎01-06-2023

In fact, the percentages can sum to much more than 100%

Remember that if you're statsing by multivalued fields (or if you count all events early and mvexpand your fields later) the same original event can be counted multiple times.

Percentage counted this way does make sense but one has to understand what sense it is.

PickleRick · ‎01-06-2023

It's most probably because eventstats is a dataset processing command which means that it needs to get all the events from your search before it can process them. That's not good for big searches.

You could try to "cheat" by using streamstats and then chosing max value of your streamstats.

Why am I receiving error trend stats with multiple error codes from each event?

table

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25

Are you a member of the Splunk Community?

Why am I receiving error trend stats with multiple error codes from each event?

table

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25