I have the following queries
query 1 : index1 .... | table _time uniqueID
query 2 : index2 .... | table _time uniqueID
And I am trying to find events where the uniqueID is found in both AND the the subtraction between the time is greater than N milliseconds
Ideally, the output should be something like :
uniqueID | System 1 time | System 2 time | Difference in Millis |
123 | Time 1 | Time 2 | 500 |
1234 | Time 11 | Time 22 | 60 |
Could you please help?
Basic approach could be
(search1) OR (search2)
| sort _time
| streamstats global=f window=2 range(_time) as range min(_time) as min max(_time) as max by uniqueID
| where range > 50
| eval t1=strftime(min, "%F %T"), t2=strftime(max, "%F %T")
| table uniqueID t1 t2 range
but what you get will depend on how many of each uniqueID there are per search/index, whether there are duplicates, so you may need to tweak this to give you the results you want
Basic approach could be
(search1) OR (search2)
| sort _time
| streamstats global=f window=2 range(_time) as range min(_time) as min max(_time) as max by uniqueID
| where range > 50
| eval t1=strftime(min, "%F %T"), t2=strftime(max, "%F %T")
| table uniqueID t1 t2 range
but what you get will depend on how many of each uniqueID there are per search/index, whether there are duplicates, so you may need to tweak this to give you the results you want
Could you please explain the query?
Query does
(search1) OR (search2)
| sort _time
searches both data sets and sorts by time - may not be necessary to sort, as events should out in time order
| streamstats global=f window=2 range(_time) as range min(_time) as min max(_time) as max by uniqueID
Looks over a window of 2 events with the same uniqueID and calculates the 'range' of time between the two, storing the result in a new field called 'range'. value will be range of the two _time values, so in seconds.
| where range > 50
filters out the range you are looking for, here 50 seconds.
then formatting table. Note that this will just give you the gap between every 'pair' of events, so if you have 3 events for index 1 and 2 for index 2 for the same ID, then you will get the difference between each event, hence my comment about duplicates, i.e. you need to know what you expect to get given your data.
What do you recommend to handle duplicates?
We expect to have duplicates but the duplicate will be present in both
Can you give an example of how you would get multiple events for the same uniqueID across each data set, the ordering of how they might appear and their significance.
i.e. if your data set had this sequence, what 'range' are you looking for?
10:00:01 index1 event_1 ID1
10:00:05 index1 event_2 ID1
10:00:09 index2 event_3 ID1
10:00:14 index2 event_4 ID1
that very much dictates the solution
Usually, the expected path is index1->index2
But occasionally it is possible to see index1 -> index1 -> index2 -> index1
For example, for the base duplicated case, I am expecting to see something like
10:00:01 index1 event_1 ID1
10:00:05 index2 event_1 ID1
10:00:09 index1 event_1 ID1
10:00:09 index1 event_1 ID2
10:00:14 index2 event_1 ID2
10:00:14 index2 event_1 ID1
All events are expected to land both indexes in order. First in index1 and then in index2
For this scenario, I am expecting to see both results in the table in the order that they appeared
At the same time, occasionally and when the events are lost, it is possible to see this
10:00:01 index1 event_1 ID1
10:00:09 index1 event_1 ID1 (a retry of a request that took too long or was lost in transit)
10:00:14 index2 event_1 ID1
10:00:15 index1 event_1 ID1 (a delayed response)
This situation is possible but not as common
The expected output here is trickier. When this happens, we can try using the second request and the latest response even if it is not as accurate
Using that example data you've shown, here is some SPL that hopefully will point you in the right direction about how to handle these and what to do with the possible values of range that get calculated.
It's non-trivial to get SPL that will handle a non-predictable dataset, so you'll have to see what values you can get out of the data and what will make sense in each condition.
But the basis is that streamstats will do the job of calculating deltas between field values, here it's using a window of 2 for each ID and from there you can probably work out which values are important
| makeresults
| eval _raw="10:00:01 index1 event_1 ID1##10:00:05 index2 event_1 ID1##10:00:09 index1 event_1 ID1##10:00:09 index1 event_1 ID2##10:00:14 index2 event_1 ID2##10:00:14 index2 event_1 ID1##"
| eval x=split(_raw, "##")
| mvexpand x
| rex field=x "(?<T>[^ ]*) (?<ind>[^ ]*) (?<EV>[^ ]*) (?<ID>.*)"
| eval _time=strptime(T, "%T")
| table _time ind EV ID x
| streamstats global=f window=2 range(_time) as range min(_time) as min max(_time) as max by ID
| stats values(x) as dataset min(min) as min max(max) as max list(ind) as ind list(range) as range values(EV) as EV by ID
| eval total_duration=max-min
| append [
| makeresults
| eval _raw="10:00:01 index1 event_1 ID3##10:00:09 index1 event_1 ID3##10:00:14 index2 event_1 ID3##10:00:15 index1 event_1 ID3"
| eval x=split(_raw, "##")
| mvexpand x
| rex field=x "(?<T>[^ ]*) (?<ind>[^ ]*) (?<EV>[^ ]*) (?<ID>.*)"
| eval _time=strptime(T, "%T")
| table _time ind EV ID x
| streamstats global=f window=2 range(_time) as range min(_time) as min max(_time) as max by ID
| stats values(x) as dataset min(min) as min max(max) as max list(ind) as ind list(range) as range values(EV) as EV by ID
| eval total_duration=max-min
]
Hope this helps