I have a set of events with the pattern that there's a single event A that pairs with many event Bs (based on a field let's call CorrelationId). The event A has a field I want on all of the B events. The events can come in in any order. We might have the following (event type followed by CorrelationId):
A (Correlation ID: 1, FieldToInject: 10)
B (Correlation ID: 1)
B (Correlation ID: 1)
B (Correlation ID: 1)
B (Correlation ID: 1)
B (Correlation ID: 2)
B (Correlation ID: 2)
B (Correlation ID: 3)
A (Correlation ID: 3, FieldToInject: 100)
A (Correlation ID: 2) FieldToInject: 50
B (Correlation ID: 2)
B (Correlation ID: 2)
B (Correlation ID: 3)
B (Correlation ID: 3)
The new output should look like:
A (Correlation ID: 1, FieldToInject: 10)
B (Correlation ID: 1, FieldToInject: 10)
B (Correlation ID: 1, FieldToInject: 10)
B (Correlation ID: 1, FieldToInject: 10)
B (Correlation ID: 1, FieldToInject: 10)
B (Correlation ID: 2, FieldToInject: 50)
B (Correlation ID: 2, FieldToInject: 50)
B (Correlation ID: 3, FieldToInject: 100)
A (Correlation ID: 3, FieldToInject: 100)
A (Correlation ID: 2) FieldToInject: 50
B (Correlation ID: 2, FieldToInject: 50)
B (Correlation ID: 2, FieldToInject: 50)
B (Correlation ID: 3, FieldToInject: 100)
B (Correlation ID: 3, FieldToInject: 100)
There are a couple of ways I can think of to do this. I could use an aggregation:
eventstats first(FieldToInject1) AS FieldToInject1, first(FieldToInject2) AS FieldToInject2 BY CorrelationId
That works, but I don't imagine is very efficient - I know all of the events will come in in a short window, but this call will keep looking for events from a given CorrelationId throughout the entire search.
The other obvious option is a transaction:
transaction CorrelationId maxspan=1m
The problem here is that, because we more than one B event, I need to play games of zipping of multivalue fields and then mvexpanding to make any sense of things.
Is there a more natural way folks would recommend attempting to do something like this?
Your eventstats
solutions is perfect and I would use that. If you'd like to experiment, and you are sure that you can isolate an appropriate time window, you could try something like this:
... | streamstats time_window=200 first(FieldToInject1) AS FieldToInject1, first(FieldToInject2) AS FieldToInject2 BY CorrelationId
Depending on how tightly you can tune it, this could be more efficient but could also be way less so.
@doweaver, for the sample data provided in the question, can you please state what is the desired output after correlation?
Updated with desired output.
The command that you probably need is filldown
and possibly selfjoin
. Take a very careful look at each one and play around.
I don't think filldown
works - that would blindly insert the most recently seen value for "FieldToInject", which means it would fail when we don't have A followed by all its matching Bs, and repeat.
selfjoin
doesn't seem like the answer here either, since I just want to join A to all Bs, and not Bs to Bs. I'm not seeing anything giving me that flexibility in the doc, but I'll keep looking.
@doweaver, I think filldown
will work for you provided you have sorted your data with Correlation_ID and also sort on events so that event A comes before event B. I have tried following run any where search which mocks the sample data provided in the question and the desired output.
| makeresults
| eval data="event=A,Correlation_ID=1,eventToInject=10;event=B,Correlation_ID=1;event=B,Correlation_ID=1;event=B,Correlation_ID=1;event=B,Correlation_ID=1;event=B,Correlation_ID=2;event=B,Correlation_ID=2;event=B,Correlation_ID=3;event=A,Correlation_ID=3,eventToInject=100;event=A,Correlation_ID=2,eventToInject=50;event=B,Correlation_ID=2;event=B,Correlation_ID=2;event=B,Correlation_ID=3;event=B,Correlation_ID=3;"
| makemv data delim=";"
| mvexpand data
| rename data as _raw
| KV
| table event Correlation_ID eventToInject
| sort Correlation_ID event
| filldown eventToInject
Your phrasing of the issue seems a bit odd - "keep looking for events throughout the search"? The entire file will be passed once and the values collected. first()
, earliest()
, last()
and latest()
will have slightly different effects, with first()
being marginally more efficient, but the entire file gets passed once, and has to be in Splunk.
The basic method for this is...
1) Select all events you might need.
2) Roll the A data onto the B records.
3) Drop the A records.
4) process the B records.
Yeah, my phrasing here is a product of my lack of understanding of what's happening under the covers.
It sounds like you're suggesting the aggregation approach, which should work. The main reason I asked this question is because of a "eureka" moment I had a few months ago. I was doing a lot of gross "stats first(x), first(y) BY Z" on things when I only expected a single event per Z... and then I discovered that the "xyseries" was the "right" way to do that. I was wondering if there's a "right" way to do what I wanted above, but it sounds like aggregating and using "first" is the best there is.