Alerting

How do I create an alert to trigger when a transaction with 2 events is not complete?

proylea
Contributor

I want to set up an alert to fire when a transaction with 2 events is not complete.
Here is my search:

("Submitted order" OR "Murex - Received ExecutionReport" ) sourcetype="enable-integration" | rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)" | transaction TAG198 startswith="Submitted order" | search eventcount=1

This needs to run in real-time over the last minute, the trouble I am having is that the start and finish transaction are about 5 seconds apart and the alert fires when the start transaction appears before the finish event has time to come in.

Can someone point me in the right direction please?

Kind Regards
Peter

1 Solution

emiller42
Motivator

Try a non-transaction approach:

 sourcetype="enable-integration" ("Submitted order" OR "Murex - Received ExecutionReport") 
| rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)" 
| eval start=if(searchmatch("Submitted order"), 1, 0), end=if(searchmatch("Murex - Received ExecutionReport"), 1, 0)
| stats earliest(_time) as _time latest(_time) as last_seen sum(start) as started sum(end) as ended by TAG198
| eval time_since_last_seen=now()-last_seen
| search ended=0 AND time_since_last_seen>10

Explanation:

  1. Search for relevant log events.
  2. Extract your TAG198 field
  3. Create fields to indicate if the event is a starting or ending event. Useful later.
  4. Do a stats for each TAG198, grabbing the earliest and latest timestamps for events, and a sum of the start/end flags. Items which are ended will have a sum(end) > 0 This gives you an easy way to filter for open items.
  5. generate a field to show the time difference between when the search is run, and the most recent timestamp of the item
  6. filter for items which have not ended, and which have existed for more than 10 seconds. (Adjust as needed)

This gives you a means to say "Show me open transactions which started more than X seconds ago" which sounds like what you're looking for.

Worth noting: When trying to deal with near-realtime, you're going to start bumping into the potential of false positives caused by the fact that it takes time for data to end up in Splunk. There is a delay between log write > forwarder read > forwarder send > indexer receive > indexer write. Realtime searches insert themselves as far up the pipeline as possible (They're actually injected in the indexing pipeline before most of the parsing) but there's still a non-zero delta between log write > visible search result.

When dealing with this stuff, think about how quickly you can realistically action a problem. Getting alerted in realtime is awesome, but pointless if it takes you a couple of minutes to respond to the alert. At that point, being a minute or two behind on detection doesn't hurt your responsiveness, and makes it much easier omit false positives due to timing.

This all depends on your actual use case and expectations, so I won't tell you that you don't need realtime. But it is something to really consider.

View solution in original post

mmulani
New Member

Hi emiller42,

I was facing the same problem, used non-transaction query and it resolved the issue.
But this query is working only when the time_since_last_seen>60 if i put 300 instead of 60, it is not working.

I want if the ending message didnot come within 5 minutes, we should get an alert. But not sure why it isn't working. Please help.

Thanks

0 Karma

proylea
Contributor

The working result is at the bottom provided by emiller

0 Karma

emiller42
Motivator

Try a non-transaction approach:

 sourcetype="enable-integration" ("Submitted order" OR "Murex - Received ExecutionReport") 
| rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)" 
| eval start=if(searchmatch("Submitted order"), 1, 0), end=if(searchmatch("Murex - Received ExecutionReport"), 1, 0)
| stats earliest(_time) as _time latest(_time) as last_seen sum(start) as started sum(end) as ended by TAG198
| eval time_since_last_seen=now()-last_seen
| search ended=0 AND time_since_last_seen>10

Explanation:

  1. Search for relevant log events.
  2. Extract your TAG198 field
  3. Create fields to indicate if the event is a starting or ending event. Useful later.
  4. Do a stats for each TAG198, grabbing the earliest and latest timestamps for events, and a sum of the start/end flags. Items which are ended will have a sum(end) > 0 This gives you an easy way to filter for open items.
  5. generate a field to show the time difference between when the search is run, and the most recent timestamp of the item
  6. filter for items which have not ended, and which have existed for more than 10 seconds. (Adjust as needed)

This gives you a means to say "Show me open transactions which started more than X seconds ago" which sounds like what you're looking for.

Worth noting: When trying to deal with near-realtime, you're going to start bumping into the potential of false positives caused by the fact that it takes time for data to end up in Splunk. There is a delay between log write > forwarder read > forwarder send > indexer receive > indexer write. Realtime searches insert themselves as far up the pipeline as possible (They're actually injected in the indexing pipeline before most of the parsing) but there's still a non-zero delta between log write > visible search result.

When dealing with this stuff, think about how quickly you can realistically action a problem. Getting alerted in realtime is awesome, but pointless if it takes you a couple of minutes to respond to the alert. At that point, being a minute or two behind on detection doesn't hurt your responsiveness, and makes it much easier omit false positives due to timing.

This all depends on your actual use case and expectations, so I won't tell you that you don't need realtime. But it is something to really consider.

lmonahan
Path Finder

This is awesome, thank you!

0 Karma

mmulani
New Member

Hi emiller42,

I was facing the same problem, used non-transaction query and it resolved the issue.
But this query is working only when the time_since_last_seen>60 if i put 300 instead of 60, it is not working.

I want if the ending message didnot come within 5 minutes, we should get an alert. But not sure why it isn't working. Please help.

Thanks

0 Karma

proylea
Contributor

Hi Miller, Rich
Thanks guys for your work on this, it's been a good learning process.
the emiller query with one small correction on line 3 to put a pipe between the evals:
| eval start=if(searchmatch("Submitted order"), 1, 0) | eval end=if(searchmatch("Murex - Received ExecutionReport"), 1, 0)

Worked like a charm.
I am now getting alerts for incomplete transactions
Kind Regards
Peter

0 Karma

emiller42
Motivator

Ahh, the chained eval thing is new in 6.3. If you're still on 6.2 you need to eval each separately. Apologies for the confusion there!

EDIT: Whoops, I had a typo too. Should be | eval start=if(searchmatch("Submitted order"), 1, 0), end=if(searchmatch("Murex - Received ExecutionReport"), 1, 0) I've corrected it above.

If this did work, can you mark it as accepted so others can find this?

Thanks!

Richfez
SplunkTrust
SplunkTrust

There's going to be some difficulty in using transaction in real time to find transactions that aren't closed in real time. So can we make sure we're talking about precisely the same things?

By "real-time over the last minute" do you mean that once per minute the search will run and look at the past minute and alert on transactions that aren't complete? This is certainly possible. The solution here is to use a latest/earliest timeframe that shifts the "current" time backwards by 10 seconds to give transactions time to close (in your root search, put earliest=-10s latest=-70s then run that once per minute.)

Or do you really mean actual "real-time" .. in which case "over the last minute" isn't really applicable and this becomes a bit more difficult?

proylea
Contributor

Here is what I did in the end to get the alert result I was looking for

"Submitted order" sourcetype="enable-integration" earliest=-60s latest=-15s | rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)" | join TAG198 [ search "Murex - Received ExecutionReport" sourcetype="enable-integration" earliest=-60s | rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)" ] | transaction TAG198 startswith="Submitted order" | search eventcount=1

the search looks for event 1 15 seconds out over 45 seconds then the join looks for the end event in the last minute.

0 Karma

Richfez
SplunkTrust
SplunkTrust

I'd overlap those time frames. As you have it configured now it will only connect the dots between events that have a start in the first half of a minute and an end in the last half of the minute, so if it starts at twelve seconds after and finishes nineteen seconds after, you'll miss it. I'm fairly certain, anyway - test and you can be sure one way or the other. You shouldn't have a problem out of making the first be -60s to -15s and the second being -60s to now - you can use the same start time because the transaction will fix that, it's the ending ones that matter.

Also, probably just a typo but you have "now" in your subsearch. That will search for only events that have the literal word "now" in them. Since you can look for those at the current time, you can just leave "now" out entirely. Also test this to make sure, but I'm confident removing now will be fine (if it works now, it's accidental and due to having the word "now" in the events, but if you don't need it, don't rely on it. 🙂

0 Karma

proylea
Contributor

Hi Rich
Wondering if you could assist again, the search is returning false alerts.
I thought what I was doing with that search was finding part 1 of the transaction between 15 and 60 seconds out from the current time. Then doing a join to see if the transactions part 2 is found in the last 60 seconds, thus giving me sufficient overlap to identify if there is a completed transaction.

However the search returned a positive result ( as in transaction not complete ) for the transaction below which actually did complete withing 2 seconds.

2015-11-17 00:36:44.390 [NABClientServices-akka.actor.default-dispatcher-649] INFO a.c.n.s.s.u.l.Slf4jTradingLogger - [orderId=SYD171115-0000020811 | event=BOOKING] Submitted Order to Bank Services: 8=FIXT.1.19=57735=D49=Star Node (0)52=20151117-00:36:44.38956=BOOKING1128=911=SYD171115-0000020811198=SYD171115-0000020811-s110109=SX10110=ESP453=2448=1622447=D452=3448=dpolidan447=D452=1278=179=7431507736=JPY737=8759000080=1000000.00711=3311=AUD/JPY10114=TRADER1045=87.59311=AUD/USD10114=TRADER1045=0.71036907173252311=USD/JPY10114=TRADER1045=123.302102365432463=064=2015111975=2015111755=AUD/JPY461=RCSXXX60=20151117-00:36:44.34740=D54=238=1000000.0044=87.590631=87.596015=AUD117=43dac3f1-8cc3-11e5-bb99-d89d671a488011005=N11090=DEFAULT11006=N10=235 

2015-11-17 00:36:45.962 [NABClientServices-akka.actor.default-dispatcher-183] INFO a.c.n.s.s.u.l.Slf4jTradingLogger - [orderId=SYD171115-0000020811 | event=BOOKING] Murex - Received ExecutionReport [sxId=198=SYD171115-0000020811-s1]: 8=FIXT.1.19=46035=849=BOOKING52=20151117-00:36:45.66856=STAR1128=937=2757919011=SYD171115-0000020811198=SYD171115-0000020811-s110109=SX10110=ESP17=cc0f82cf-66b3-4d37-9fb9-db21e65f39c6150=F39=2453=1448=DPOLIDAN452=1278=179=7431507736=JPY737=8759000080=1000000.00711=1311=AUD/JPY1045=87.5963=064=2015111975=2015111710111=SP55=AUD/JPY461=RCSXXX60=20151117-00:36:44.00054=238=1000000.0014=1000000.00151=044=87.590631=87.596015=AUD11080=0.0010=164

Am I using the time frame for the searches correctly?

Kind Regards
Peter

0 Karma

Richfez
SplunkTrust
SplunkTrust

So, does it usually work correctly and you have a few oddballs? Or does it usually not work and the oddballs are ones where it does?

My guess is the former, and that it has to do with the timing of when things run. I think, given how you are doing it, that your time frames are sort of correct, but maybe they're either in the wrong places. I don't do joins often so aren't as familiar with them as with other things, but reading the join docs I have some ideas.

I expanded your main searches time - if you are running in once per minute and if you have it like you do, are only going back -60 seconds to -45 seconds and are therefore skipping 15 seconds each minute. Try these times, they make you not miss data, and they might open the "closing" events to a catch them all.

"Submitted order" sourcetype="enable-integration" earliest=-75s latest=-15s | rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)" | join TAG198 [ search "Murex - Received ExecutionReport" sourcetype="enable-integration" earliest=-90s | rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)" ] | transaction TAG198 startswith="Submitted order" endswith="Murex - Received ExecutionReport" | search eventcount=1

Now, I think that might be more work than it needs. (No, I'm not convinced this is the case, I just think it right now... 🙂 ) Perhaps... OK, if you rex your TAG198, you can then search where that exists and create your transaction on just those items them.

sourcetype="enable-integration" earliest=-75s latest=-15s | rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)" | search TAG198=* | transaction TAG198 startswith="Submitted order" endswith="Murex - Received ExecutionReport" | search eventcount=1

Maybe. No, I think not. This will still leave transactions near the end points open when they shouldn't be.

But really from your description earlier, you need to find out where you have a "Submitted order" without a subsequent "Murex - Received ExecutionReport" within ... 15 seconds? If you are running this at one minute intervals, and again if we assume that anything that starts will end within 15 seconds (shifting the time to -75s to -15s, run once per minute), we could probably work something out with stats (not likely) or creating a lookup and using that (more likely).

But, with most of these I'm still trying to wrap my head around it but I think you'll be hard pressed to remove all the problems. You almost have to limit the "submitted order" down to a short time period (-75 to -15s) then let the check for the closing action run fairly wild. Because you really NEED the match if a match can be found, even if it's a bit out of the time-frame you thought it would be in.

Anyway, give those a try, let me know.

0 Karma

proylea
Contributor

I get only false results with this.
When I breakdown the search I can see the incoming start transaction

"Submitted order" sourcetype="enable-integration" earliest=-75s latest=-15s | rex "(?i).*?[^=](?P<TAG198>198=[^\x01]*)"

As soon as I add the join search there is no longer a result returned for the first part
I really want to treat those 2 searches independently.
I imagine I would almost want to use the join search that is trying to locate the end event to populate a value that I can test against later, like an eval or something
I'm stuck

0 Karma

proylea
Contributor

I'm a bit confused as to why this seems so hard to do, I would have thought that it would be a primary use for a command like transaction

0 Karma

Richfez
SplunkTrust
SplunkTrust

Well, sort of, but not quite.

Transaction is used to group events together. If you wanted to do this, I doubt you'd even have questions - it would probably just work.

But what you actually want aren't transactions. I think I mentioned this in a recent comment; What you really want to find are events which COULD start a transaction but don't because they have no closing event. "Start" with no "finish". How do we know there's no closing event due to there being no closing event? How do we know it just didn't take a few moments longer? How can we tell if one side of the transaction or the other just wasn't in our time
frame? We can pretty easily do that if we deal with longer, older time frames, but because we're dealing with near real time stuff, this gets significantly harder.

The point is that we can separate "good transactions" from everything else easily, but what's left over aren't necessarily "bad transactions" based on your criteria of bad transactions. What's left over is all events that didn't fit the precise definition of the transaction, some of which are your target events, some of which may or may not be.

That's why after a few only partially successful attempts I started approaching it from the view of "what actually is it you are trying to do?" I hadn't quite gotten there yet, but there's no time like the present to step back and confirm what it is we're actually trying to accomplish with this:

What is it you are trying to accomplish, overall, with knowing within seconds that some "transaction" isn't complete? Can you get the information more easily elsewhere - is there a service that fails or is failing when this happens? What would you do differently, or what SLA would be significantly affected by knowing this within seconds instead of within minutes?

I'm also going to poke some other folks on IRC (who are as a rule far brighter than I am) to see if anyone has better thoughts on how to accomplish your objectives. If you aren't on there, you might want to pop on and ask around. It's another community supported thing, but it's far more "interaction" based than answers. I bet we can help you better there.

0 Karma

Richfez
SplunkTrust
SplunkTrust

Proylea,

zerf from IRC suggested your join probably needs to be "join type=left ..." because if you are using that TAG to join them on, join defaults to "inner" which means it has to exist on both sides of the join (main search and subsearch) and, that's exactly what you are looking to not find. "left" will make it include where there are no matches on the other side).

Not sure which way around the two searches (main and subsearch) need to be, but it should be easy to test each.

emiller42 did some thinking and looking and is also posting an answer - if that works, feel totally free to unaccept my answer and accept his instead! You won't hurt my feelings - I'd rather have the "most right" answer flagged than mine.

0 Karma

proylea
Contributor

Thanks Rich I've updated the search and all is well
Cheers

0 Karma

proylea
Contributor

That's exactly what I needed thanks.

0 Karma

proylea
Contributor

I wasn't able get this query working in the end, below is the query provided that got the alert working perfectly

0 Karma

Richfez
SplunkTrust
SplunkTrust

I've converted my comment to an answer so that you can accept it.

Glad we could help!

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...