Solved: How can I use joins (or an alternative) for a sear...

DrRich · ‎08-10-2017

Hi,

I've written a query (see original query below) which joins 3 different event types to display A_events started during the selected date range. The A_events are selected using the timepicker and these are left joined via the event_id to return the B_events, returning the B_time, along with the packet_size. Finally, these are left joined using the device_id to the C_events to return the contextual data.

I'm having a number of issues here:

Using joins is too slow hence why I have played around with stats (see first stats query).
There is a limit on the number of events returned by a sub-search (50,000 events) so my C_events are being truncated.
My first attempt at writing a stats based query is faster than joins but because the event selection criteria has been moved to the top level of the query removing the need for the first join I don't seem to be able to have different time periods for each of the event types e.g. (event="A" timepicker based) OR (event="B" timepicker value + 2 days)
The other issue about having all the event selection at the top level is the C_events can occur anytime during the last 2 years and there are likely to be multiple events per device_id of which I need to return the most recent, hence the dedup on the second subsearch.
I have tried to use an append to apply different timescales to the different event types but this too runs very slowly and there is a limit of 50,000 events which are being truncated.
Finally, I am struggling with stats because the fields used to join the sub-queries are different. The A and B events have device_id and event_id as common fields where as the C events only have device_id
I did try a second stats query but I ended up with listing multiple event_ids within the same row as they had the same device_id.

Question
How would you advise I tackle this query based upon having to join 3 different event types which require 3 different time scales and are at different levels of granularity?

Any help would be greatly received as I have been trying to do this for several weeks now and am completed stumped!!

Thanks

Original Query

index=50 (event="A")
    | fields device_id, event_id, _time
    | eval A_time=strftime(_time,"%Y/%m/%d %H:%M:%S")

    | join event_id type=left [search index=50 earliest=1 event="B"
        | fields device_id, event_id, _time, packet_size 
        | eval B_time=strftime(_time,"%Y/%m/%d %H:%M:%S")]

    | join device_id type=left [search index=50 earliest=1 event="C"
        | dedup device_id sortby -_time
        | fields device_id, contextual_data]
    | eval status=case(
                       isnull(A_time),"Delete",
                       isnull(B_time),"Open",
                       1=1,"Closed"
                      )
    | table device_id, event_id, A_time, B_time, packet_size, status, contextual_data

First Stats Query

index=50 (event="A") OR (event="B")

  | dedup event, device_id, event_id sortby -_time

  | eval A_time=if(event="A",_time,"")
  | eval B_time=if(event="B",_time,"")


  | eval packet_size=case(
                           event="A","",
                           event="B",packet_size
                           )                 

  | stats values(A_time) AS A_time, values(B_time) as B_time, values(packet_size) AS packet_size by device_id, event_id

  | convert num(A_time) AS A_time, num(B_time) AS B_time, num(packet_size) AS packet_size

  | eval A_time=strftime(A_time,"%Y/%m/%d %H:%M:%S")
  | eval B_time=strftime(B_time,"%Y/%m/%d %H:%M:%S")

  | eval status=case(
                    isnull(A_time),"Delete",
                    isnull(B_time),"Open",
                    1=1,"Closed"
                    )

Desired Output

DalJeanis · ‎08-10-2017

Please test this by hard coding the times as noted. the earliest time for event B only would need to be +2d if the same event and eventid might appear as a duplicate during the earlier 2d time frame, when NOT found in he later 2d time period.

index=50 (event="A" earliest=x latest=y) OR (event="B" earliest=x latest=y+2d) OR (event="C" )
| fields _time, event, device_id, event_id, packet_size, contextual_data

| rename COMMENT as "get only the most recent event A and B records, keeping all event C records"
| dedup event, device_id, event_id keepempty=t

| rename COMMENT as "get only the most recent event C record for each device, keeping any records that aren't event c..."
| eval killme=if(event="C","Y",null())
| dedup killme deviceid keepempty=t 

| rename COMMENT as "copy the contextual data to all records with that deviceID then delete the event C" 
| eventstats max(contextual_data) as contextual_data by device_id
| where isnull(killme)

| rename COMMENT as "now we are left with the most recent A and B records." 
| eval A_time = if(event="A",_time,null())
| eval B_time = if(event="B",_time,null())
| stats max(A_time) as A_time, 
    max(B_time) as B_time, 
    max(packet_size) as packet_size, 
    max(contextual_data) as contextual_data,
    values(event) as event  
    by device_id, event_id

| eval status=case(
                 isnull(A_time),"Delete",
                 isnull(B_time),"Open",
                 true(),"Closed"
                 )

| eval A_time=strftime(A_time,"%Y/%m/%d %H:%M:%S")
| eval B_time=strftime(B_time,"%Y/%m/%d %H:%M:%S")                      )
| table device_id, event_id, A_time, B_time, packet_size, status, contextual_data

View solution in original post

DalJeanis · ‎08-10-2017

Please test this by hard coding the times as noted. the earliest time for event B only would need to be +2d if the same event and eventid might appear as a duplicate during the earlier 2d time frame, when NOT found in he later 2d time period.

index=50 (event="A" earliest=x latest=y) OR (event="B" earliest=x latest=y+2d) OR (event="C" )
| fields _time, event, device_id, event_id, packet_size, contextual_data

| rename COMMENT as "get only the most recent event A and B records, keeping all event C records"
| dedup event, device_id, event_id keepempty=t

| rename COMMENT as "get only the most recent event C record for each device, keeping any records that aren't event c..."
| eval killme=if(event="C","Y",null())
| dedup killme deviceid keepempty=t 

| rename COMMENT as "copy the contextual data to all records with that deviceID then delete the event C" 
| eventstats max(contextual_data) as contextual_data by device_id
| where isnull(killme)

| rename COMMENT as "now we are left with the most recent A and B records." 
| eval A_time = if(event="A",_time,null())
| eval B_time = if(event="B",_time,null())
| stats max(A_time) as A_time, 
    max(B_time) as B_time, 
    max(packet_size) as packet_size, 
    max(contextual_data) as contextual_data,
    values(event) as event  
    by device_id, event_id

| eval status=case(
                 isnull(A_time),"Delete",
                 isnull(B_time),"Open",
                 true(),"Closed"
                 )

| eval A_time=strftime(A_time,"%Y/%m/%d %H:%M:%S")
| eval B_time=strftime(B_time,"%Y/%m/%d %H:%M:%S")                      )
| table device_id, event_id, A_time, B_time, packet_size, status, contextual_data

DrRich · ‎08-11-2017

Thanks for your response, very much appreciated and this has solved a number of the problems I was having, in particular the sub-search truncation issue. I've included the actual query below that I am now using based upon your answer.

I hope you don't mind but I have a couple of other points I'm now trying to clear up and wondered if you could point me in the right direction?

The initial search now has the 3 event types with some hard-coded date-time ranges. I need to get the following functionality probably using relative time modifiers.

selection of start events (type="Start") needs to be driven from the time picker which in-turn drives the selection of the remaining 2 event types. The stop events (type="Stop") require a time range with the same starting point the same as the start events and an end point finishing 2 days after the end point of the start events. The portal events are a tricky as they could be generated at any-point in the last 1-3 years hence why I'm using earliest=1.

This relates to the speed of the query which is being hindered by the fact that a Portal type event, which contains the contextual data, can be created at any-point in time. Some of the indexes that I am running this query against are small (<20,000 portal events) and so the query is pretty quick (circa 1-2 seconds) but some indexes are large (circa 0.75 million events) and consequently the the query is taking around 30-40 seconds. Essentially I would like to return the most recent Portal event for all device_mac_addresses where there is a start event. Is there a more efficient way of doing this that will correlate the events within the initial search?

Many thanks in advance

index=50 (type="Start" AND termination_cause!="Resumed" earliest=1502233200 latest=1502319600) 
  OR (type="Stop" AND termination_cause!="Suspect-Logout" earliest=1502233200 latest=1502492400) 
  OR (type="Portal" view="PortalView_Process_*" earliest=1 latest=1502492400)

          |rename COMMENT as "rename a select required fields"
  | rename user_agent.Browser as browser_type, user_agent.Device_Type as device_type
  | fields _time, type, device_mac_address, session_id, browser_type, device_type

          | rename COMMENT as "Get only the most recent event Start and Stop records, keeping all Portal events."
  | dedup type, device_mac_address, session_id keepempty=t

          | rename COMMENT as "Get only the most recent Portal events for each device, keeping any records that aren't Portal events."
  | eval killme=if(type="Portal","Y",null())
  | dedup killme device_mac_address keepempty=t

          | rename COMMENT as "copy the contextual data to all records with that deviceID then delete the event C" 
  | eventstats max(browser_type) as browser_type, max(device_type) as device_type, max(portal_event_data) as event_data by device_mac_address

          | rename COMMENT as "Remove potral events." 
  | where isnull(killme)

          | rename COMMENT as "Extract data from common fields."
  | eval session_start_time_unix=if(type="Start",_time,"")
  | eval session_stop_time_unix=if(type="Stop",_time,"")
  | eval session_start_time=strftime(session_start_time_unix,"%Y/%m/%d %H:%M:%S")
  | eval session_stop_time=strftime(session_stop_time_unix,"%Y/%m/%d %H:%M:%S")

          | rename COMMENT as "Join start and stop events based upon device_mac_address and session_id."
  | stats max(session_start_time) as session_start_time, 
      max(session_stop_time) as session_stop_time, 
      max(browser_type) as browser_type,
      max(device_type) as device_type,
      max(event_data) as event_data
      by device_mac_address, session_id

          | rename COMMENT as "Calculate session status and delete stop events where no start event exists."
  | eval session_status=case(
                            isnull(session_start_time) AND isnotnull(session_stop_time),"Delete",
                            isnull(session_stop_time),"Open",
                            1=1,"Closed"
                            )
  | where session_status!="Delete"

          | rename COMMENT as "Format data for presentation purposes."
  | sort device_mac_address, session_start_time, session_stop_time
  | streamstats count as row_no
  | table row_no, device_mac_address, session_id, session_start_time, session_stop_time, browser_type, device_type, session_status

DalJeanis · ‎08-14-2017

Hmmm. Efficiency of this kind of thing is going to be very data dependent.

We could build a pretty complicated thing in order to try to be "efficient", and end up shooting ourselves in the foot, leg, private parts and jaw.

Depending on how often the search was going to be run, and the relevant duration of searches, you could do a number of different things.

If the number of device mac addresses is always going to be extremely small relative to the total number of portal events, then it might MIGHT be efficient to pass the desired mac addresses to a subsearch, for example |appendpipe [-> table -> format -> map portal search]. However, I've never seen map run any faster than a regular search, and it usually seems slower.

An alternative would be to periodically generate a lookup table or summary index for your portal events, containing only the most recent context information. Except... that doesn't work for your use case, since you want the most recent prior context.... so it would have to be a time-based lookup table. These can be really fast, and a few million events shouldn't be too much of a problem.

DrRich · ‎08-15-2017

Thanks for this, I'll give it a try. I'm also looking at the time modifiers on my 3 event based searches to restrict it this way.

I'd like the time-modifier for the start events (date range A) to be driven from the time picker but I would like the date ranges for Stop &Portal events to start 1 day prior to and end 1 day after the time-picker date range. So for example if the 2nd Aug is picked on the time-picker to return start events I'd like to apply the date range 1st Aug - 3rd Aug for the Stop & Portal events. I've played around with relative time modifiers but haven't yet successfully managed to get them to be related to the time-picker. Any ideas?

Thanks again

 index=50 (type="Start" AND termination_cause!="Resumed"  {date range A}) 
   OR (type="Stop" AND termination_cause!="Suspect-Logout"  {date-range B}) 
   OR (type="Portal" view="PortalView_Process_*"  {date range C})

alemarzu · ‎08-11-2017

Did you try using multisearch ?

https://docs.splunk.com/Documentation/Splunk/6.6.1/SearchReference/Multisearch

DrRich · ‎08-14-2017

I've tried multi-search and this didn't seem to give anything over different to the above. Thanks

How can I use joins (or an alternative) for a search that joins 3 different event types each on different time scales?

Can’t make it to .conf25? Join us online!

Take Action Automatically on Splunk Alerts with Red Hat Ansible Automation Platform

Calling All Security Pros: Ready to Race Through Boston?

Beyond Detection: How Splunk and Cisco Integrated Security Platforms Transform ...

Are you a member of the Splunk Community?

How can I use joins (or an alternative) for a search that joins 3 different event types each on different time scales?

Can’t make it to .conf25? Join us online!

Take Action Automatically on Splunk Alerts with Red Hat Ansible Automation Platform

Calling All Security Pros: Ready to Race Through Boston?

Beyond Detection: How Splunk and Cisco Integrated Security Platforms Transform ...