Alerting

How to write a query to capture consecutive failure transaction

Hemnaath
Motivator
Problem statement: Monitor the event sequence and trigger an alert when any transaction failure due to error code (http_5xx) as technical failure within 10m interval. 

Use-case 1:
10 transactions -->5  consecutive success and 5 consecutive technical failure with error code(http_5xx) considered for alert

Use-case 2:
5 transactions  --> 3 consecutive technical failure with errorCode(http_5xx) --> 1 validation failure with errorCode(http_4xx)--> technical failure with errorCode(http_5xx) should be considered for alert

Use-case 3:
10 transaction --> 2 consecutive success -->4 consecutive technical failure ---> 1 success ---> 2 tech failure --> Considered for alert

Use-case 4:
Tech failure or No Response
========================================================
Application Event Flow Sequence for different Scenario

For a particular Transactionid:517923784

Success-Sequence

MOSRequest --> Validation Passed -->ITAM Request -->ITAM Response (code202) --> SLF Request --> SLF Response(202) -->SIPN Request --> SIPN Response(202) --> MOSResponse(202)

Validation Failure Sequence for SLF System & Technical Failure for MOS Response Sequence

MOSRequest --> Validation Passed --> ITAM Request --> ITAM Response(code202) -->SLF Request -->SLF Error(404) -->SLF Request -->SLF Error(400) --> MOSError(500)

Validation Failure Sequence for SIPN System & MOS system

MOSRequest --> Validation Passed --> ITAM Request --> ITAM Response(code202) -->SLF Request -->SLF Response(202) -->SIPN Request -->SIPN Error(400) --->SIPN Request -->SIPN Error(400) -->SIPN Request -->SIPN Error(400) -->MOSError(400)

Validation Failure Sequence for MOS System

MOS Request -->MOS Error(400)

Technical Failure Sequence for SIPN System

MOSRequest --> Validation Passed --> ITAM Request --> ITAM Response(code202) -->SLF Request -->SLF Response(202) -->SIPN Request -- SIPN Error(500) -->SIPN Request -- SIPN Error(500)

Technical Failure Sequence for SLF System

MOSRequest --> Validation Passed --> ITAM Request --> ITAM Response(code202) -->SLF Request -->SLF Error(500)-->SLF Request -->SLF Error(500) 
 
Technical Failure for MOS Response Sequence

MOSRequest --> Validation Passed --> ITAM Request --> ITAM Response(code202) -->SLF Request -->SLF Error(404) -->SLF Request -->SLF Error(400) --> MOSError(500)
======================================================================

Use case : When No Response from MOS/SLF/SIPN  of the above application should be considered as technical failure

NO Response Sequence for MOS system

MOSRequest --> Validation Passed -->ITAM Request -->ITAM Response (code202) --> SLF Request --> SLF Response(202) -->SIPN Request --> SIPN Response(202) --> MOSResponse(202) ---> Success event will have MOS Response at last transaction.

MOSRequest --> Validation Passed --> ITAM Request --> ITAM Response(code202) -->SLF Request -->SLF Error(404) -->SLF Request -->SLF Error(400) --> MOSError(500) --> Failure Event will have MOSError(500) at last transaction.

But when there is any network glitch happened between transactions then there is a high chance of not getting any response or Error Event captured in the Splunk then it should be treated as Technical Error.

Similarly for SIPN and SLF

After SLF Request  when there is no sequence of events with SLF Response OR SLF Error then this is qualified as technical failure and same for SIPN events and it should be alerted for more 5 consecutive events. 
 
Query details:
 
index=X sourcetype=x source="mos_api"     --- > Filtered mos specific data 

 | rex field=_raw "^(?:[^ \n]* ){3}\[(?P<EventSeq>[^\]]+)"     --> Created field extraction to understand the flow of data sequence 
 

| rex field=_raw "\[(?P<errorCode1>[\d]+)\]"   --->  extracted the errorCode details 


| eval EventSeq_Code=EventSeq."_".errorCode1  ---> Concacinate both EventSeq and errorCode to understand the eventFlow and the errorCode for the event sequence.

| eval time=strft(_time,%d-%m-%d %H:%M:%S)  --> Convert  epoch time format to human readable format

| stats values (time) as time values(EventSeq) as EventSeq values(errorCode1) as errorCode values(EventSeq_Code) as EventSeq_Code , values(API) as API by transactionid
   ---  used stats  to get the unique values for the fields that will be used in the final result 
 
| eval alert_mos=case(EventSeq="MOSRequest" AND EventSeq="Error" AND like(EventSeq_Code,"Error_500%),"MOSTechnicalFailure",EventSeq="MOSRequest" AND EventSeq="Error" AND like(EventSeq_Code,"%Error_%400%"),"MOSValidationFailure", EventSeq="MOSRequest" AND EventSeq!="MOSResponse" AND EventSeq!="MOSError","No Response from MOS")
 
| eval alert_ITAM=case(EventSeq="ITAMRequest"  AND EventSeq="ITAMResponse" AND like(EventSeq_Code,"%200%"),"NA")
 
| eval alert_SLF=case(EventSeq="SLFRequest" AND EventSeq="Error" AND like(EventSeq_Code,"Error_500%),"SLFTechnicalFailure",EventSeq="SLFRequest" AND EventSeq="Error" AND like(EventSeq_Code,"%Error_%400%"),"SLFValidationFailure", EventSeq="SLFRequest" AND EventSeq!="SLFResponse" AND EventSeq!="SLFError","No Response from SLF")  
 
| eval alert_SIPN=case(EventSeq="SIPNRequest" AND EventSeq="Error" AND like(EventSeq_Code,"Error_500%),"SIPNTechnicalFailure",EventSeq="SIPNRequest" AND EventSeq="Error" AND like(EventSeq_Code,"%Error_%400%"),"SIPNValidationFailure", EventSeq="SIPNRequest" AND EventSeq!="SIPNResponse" AND EventSeq!="MOSError","No Response from SIPN")  
| sort _time
|  streamstats reset_on_change time_window=10m count by alert_mos,alert_ITAM,alert_SIPN,alert_SFL
 
The above  query covering most of the Scenarios and used stream stats but unable to get the output. Can you please anyone guide me whether the approach is right or query should be completely changed to get the result.
Labels (1)
Tags (1)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

_time is no longer available after the first stats command - perhaps include 

latest(_time) as _time

to get the time of the last message for each transaction?

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...