We current log two different messages for the start and completion of a workflow we would like to create an alert that tells us when we see a start log but don't see a corresponding competition log within a threshold i.e. 1 hour:
2018-06-19 03:59:59.8592|INFO|Metrics|corId=37c15fccc6a54b38a5fbbb00095f30d5;df93bc8414a34e308ca26f56ea7e911b;orders-547400b38cff40a0aa25922f7a62cd18;547400b38cff40a0aa25922f7a62cd18|messageId=3b7a948448fe4ce89445c31ddd35deb4|CustomEvent: collection=GambitV2|group=JokerWorkflows|category=Latency|label=WorkflowCompleted|value=5867|correlationId=37c15fccc6a54b38a5fbbb00095f30d5|timestamp=1529380799|tag=JokerOrderPostDeal|host=p2nmdwin00001W|env=prod|service=workflowexecutor
2018-06-19 03:59:59.8592|INFO|Metrics|corId=37c15fccc6a54b38a5fbbb00095f30d5;df93bc8414a34e308ca26f56ea7e911b;orders-547400b38cff40a0aa25922f7a62cd18;547400b38cff40a0aa25922f7a62cd18|messageId=3b7a948448fe4ce89445c31ddd35deb4|CustomEvent: collection=GambitV2|group=JokerWorkflows|category=Latency|label=WorkflowCompleted|value=5867|correlationId=37c15fccc6a54b38a5fbbb00095f30d5|timestamp=1529380799|tag=JokerOrderPostDeal|host=p2nmdwin00001W|env=prod|service=workflowexecutor
We want to write a search that will tell us when we have a log for label=WorkflowStarted but don't have a corresponding label=WorkflowCompleted where the time between the WorkflowStarted and now is greater than a certain threshold. The two messages should always be able to be joined on the correlationId field. We have tried:
index=qa-gambit CustomEvent label=WorkflowStarted | rex field=_raw "\|tag=(?<tag>.*)\|host" | eval StartTime = _time | join correlationId [search index=qa-gambit CustomEvent label=WorkflowCompleted | eval CompletedTime = _time ] | eval duration = CompletedTime - StartTime | convert dur2sec(duration) | stats count by correlationId, group, tag, duration | where count < 2 AND duration > 60
But this doesn't seem to work. Also not sure why we need to regex the tag field as all the other fields are parsed automatically.
Try something like this...
index=qa-gambit CustomEvent (label="WorkflowStarted" OR label="WorkflowCompleted" )
| eval StartTime = case(label="WorkflowStarted",_time)
| eval CompletedTime = case(label="WorkflowCompleted",_time)
| stats min(StartTime) as StartTime
max(CompletedTime) as CompletedTime
range(_time) as duration
count as eventcount
by correlationID
| where isnull(CompletedTime) AND (now() - StartTime) > 3600