Alternatives to left outer join on large dataset a...

max_y0586 · ‎02-09-2016

I have two searches with the result as displayed below. Here I want to find the service related to each activity based on the event time i.e.
For a given activity, pick the Service event which occurred closest to the Activity's event time and join the event's Service name.
The common field between these searches is the SessionId. Also the "EventTime" is an actual field in the search result and not the standard SPLUNK time (_time).

Search 1 result

Id    SessionId     Activity     Search1Time
1     123           a            2016-02-06 14:26:00Z
2     123           a            2016-02-06 14:28:00Z
3     123          b            2016-02-06 15:06:00Z
4     123          a            2016-02-06 15:36:00Z
5     234           c            2016-02-06 16:36:00Z

Search 2 result

Id    SessionId     Service      Search2Time                       
1     123           alpha        2016-02-06 12:36:00Z
2     123           gamma        2016-02-06 13:00:00Z
3     123           gamma        2016-02-06 14:00:00Z
4     123           theta        2016-02-06 15:00:00Z

Expected Intermediate Result

Id  SessionId   Activity   Service    Search1Time
1   123         a          gamma      2016-02-06 14:26:00Z
2   123         a          gamma      2016-02-06 14:28:00Z
3   123      b          theta      2016-02-06 15:06:00Z
4   123      a          theta      2016-02-06 15:36:00Z
5   234         c          Missing    2016-02-06 16:36:00Z

Expected Final Result

gamma - 2
theta - 2
Missing - 1

The solution that I am trying to use is,
Perform a left join then calculate the delta between EventTimes of Search1 and Search2 and sort it in the ascending order.
Use the "Id" column to get the unique results, which should be the one with the lowest delta.

Search2|eval Search2Time=strptime(EventTime,"%Y-%m-%d %H:%M:%S.%NZ")| stats count by SessionId Service Search2Time| 
join type=left max=0 overwrite=false SessionId
[search Search1| eval Search1Time=strptime(EventTime,"%Y-%m-%d %H:%M:%S.%NZ")| fields Id SessionId Search1Time Activity] | 
eval Delta=abs(Search1Time-Search2Time)|sort by Delta| fields Id SessionId Activity Service Search1Time
| where Id!=" "| stats dc(Id) by Service

The Actual Result for smaller dataset using the above approach is

gamma - 2
theta - 2

Issues
1. The search performance is really very slow since there are millions of events and I am using a Join
2. The subsearch limit of 50000 causes inconsistencies in the result
3. Not sure if all the edge cases are covered
4. My query doesn't cover the "Missing" case because the subsearch limit forced me to swap the searches in the join. Also, I cannot use the .conf file due to privelege restrictions.
5. Is there a way I could get the desired result and also retain all the columns from Search1 i.e. Activity details.

Can you suggest an alternative to achieve the same for large datasets addressing the "Missing" case ? Appreciate your help !

gfreitas · ‎02-17-2016

Hi Max,

I would suggest not using a join, maybe a transaction can assist you better. You basicaly can search matching the events of source 1 and source2 with something like

(search1) OR (search2)

and then simply transaction by SessionId. The final search would be smth like this:

 (search1) OR (Search2) | eval Search1Time=strptime(EventTime,"%Y-%m-%d|eval Search2Time=strptime(EventTime,"%Y-%m-%d %H:%M:%S.%NZ")| transaction  SessionId | where Id!=" "| stats dc(Id) by Service

Hope this helps!

max_y0586 · ‎02-16-2016

Can someone help me with this query ?

Alternatives to left outer join on large dataset and processing inline logic ?

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?