Hello Spelunkers,
I have a Splunk query problem that I can't seem to solve.
index=prod-web-apps sourcetype=csv-emailevents (EventName=delivered OR EventName=processed)
| head 100
| table EmailID, EventName, DateScheduled, DateSent, DateIndexed, Time.SendGrid
| Sort -EmailID
Screenshot:

What I want to do, is join the delivered event to the processed event on thier EmailID and then take the difference between thier respective DateIndexed epoch time as a new field as follows:
| eval TimeInSendGrid=DeliveredEvent.DateIndexed-ProcessedEvent.DateIndexed
Thanks in advance!
 
					
				
		
I wouldn't use join if this will work
index=prod-web-apps sourcetype=csv-emailevents (EventName=delivered OR EventName=processed)
| sort 0 EmailID  -EventName
| stats range(DateIndexed) as TimeInSendGrid by EmailID
But the above solution will eliminate all the other fields. And it doesn't really check to make sure that you have both a delivered event and a processed event. The following shows how you can retain other fields and also does the check:
index=prod-web-apps sourcetype=csv-emailevents (EventName=delivered OR EventName=processed)
| sort 0 EmailID  -EventName
| stats count last(EventName) as DeliveredEventName first(EventName) as ProcessedEventName last(DateIndexed) as DeliveredDateIndexed first(DateIndexed) as ProcessedDateIndexed range(DateIndexed) as TimeInSendGrid
| where count=2 
Both of these solutions avoid the limitations of the join command. But if you would prefer to use the join command:
index=prod-web-apps sourcetype=csv-emailevents  EventName=processed
| rename DateSent as ProcessedDateSent, DateIndexed as ProcessedDateIndexed
| join EmailID type=outer [ search index=prod-web-apps sourcetype=csv-emailevents EventName=delivered  
    | rename DateSent asDeliveredDateSent, DateIndexed as DeliveredDateIndexed]
| eval TimeInSendGrid=DeliveredDateIndexed-ProcessedDateIndexed
This will take about approximately twice as long to run, as Splunk must pass through the data twice (one for the main search and once for the subsearch). Note that you need to rename the fields, otherwise the results from the subsearch will overwrite fields of the same name from the main search.
HTH!
 
					
				
		
I wouldn't use join if this will work
index=prod-web-apps sourcetype=csv-emailevents (EventName=delivered OR EventName=processed)
| sort 0 EmailID  -EventName
| stats range(DateIndexed) as TimeInSendGrid by EmailID
But the above solution will eliminate all the other fields. And it doesn't really check to make sure that you have both a delivered event and a processed event. The following shows how you can retain other fields and also does the check:
index=prod-web-apps sourcetype=csv-emailevents (EventName=delivered OR EventName=processed)
| sort 0 EmailID  -EventName
| stats count last(EventName) as DeliveredEventName first(EventName) as ProcessedEventName last(DateIndexed) as DeliveredDateIndexed first(DateIndexed) as ProcessedDateIndexed range(DateIndexed) as TimeInSendGrid
| where count=2 
Both of these solutions avoid the limitations of the join command. But if you would prefer to use the join command:
index=prod-web-apps sourcetype=csv-emailevents  EventName=processed
| rename DateSent as ProcessedDateSent, DateIndexed as ProcessedDateIndexed
| join EmailID type=outer [ search index=prod-web-apps sourcetype=csv-emailevents EventName=delivered  
    | rename DateSent asDeliveredDateSent, DateIndexed as DeliveredDateIndexed]
| eval TimeInSendGrid=DeliveredDateIndexed-ProcessedDateIndexed
This will take about approximately twice as long to run, as Splunk must pass through the data twice (one for the main search and once for the subsearch). Note that you need to rename the fields, otherwise the results from the subsearch will overwrite fields of the same name from the main search.
HTH!
Hey Lisa,
Thank you for the prompt reply, but there is something strange in the second query where it is not grouping the stats by EmailID. I tried to add a "by EmailID" to it but that caused all sorts of other problems.
I was able to get the join working, but it's very slow. What am I doing wrong with the stats solution? I would prefer that if it is much faster.
Thanks,
Evan
 
					
				
		
Oh yeah, that is a problem - it should say "by EmailID". but on further consideration, I would probably do this for the second query:
 index=prod-web-apps sourcetype=csv-emailevents (EventName=delivered OR EventName=processed)
 | stats count latest(EventName) as DeliveredEventName earliest(EventName) as ProcessedEventName latest(DateIndexed) as DeliveredDateIndexed earliest(DateIndexed) as ProcessedDateIndexed range(DateIndexed) as TimeInSendGrid by EmailID
 | where count=2 
This should be faster and more accurate, as it simply assumes that the "processed" events will occur at an earlier time than the "delivered" events.
