Splunk Search

how to merge two datasets on a non 'primary key' field ?

pgoldweic
Communicator

I am trying to merge two datasets which are results of two different searches on a particular field value common to both. The field I want to merge on is not a 'primary key' of any of the datasets, and therefore there's multiple events in each of these datasets with a given value of this field. My expected result is that for each event in the first dataset with a particular value of that field, I will end up producing n events in the resulting dataset, where n is the number of events in the second dataset that have that particular value in the field. So for example, if I have 3 events with that field value in dataset A and 4 events with that particular field value in dataset B, then I expect to have 12 events in the result dataset (after the merge). What Splunk command/s would be useful to merge these datasets in this fashion? 

Labels (1)
Tags (1)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

Note that when using stats values(*) as * the values aggregation will remove duplicates and you will end with a sorted set of unique values for each field. If you use stats list(*) as * then you will get all values of all fields with each value in sequence in each field from the second data set including duplicates.

Note that there is a 100 event limit when using list().

The differentiation is important if you need to keep some kind of correlation between the values of each joined result based on your common key.

Note that you can combined these on any number of common keys between the two data sets as there is no concept of primary key as @yuanliu says.

pgoldweic
Communicator

Thanks @yuanliu and @bowesmana for your replies. While I thought that my original goal was clear, I have to agree with @yuanliu that I need to give a more concrete example of it (although I disagree that I was really thinking in SQL terms; I simply used the concept of 'primary key' to more succinctly say what I was trying to). I also believe that @bowesmana is getting closer to what I need. BTW, I am not very familiar with the use of OR in this fashion (are there any restrictions on the shape of those datasets? BTW, I happen to get an empty result set when I 'OR' my result sets), and also, I am not using the word 'dataset' in the official way (I simply meant a set of results from a Splunk query). So, for clarification purposes, here's what I have:

Dataset A is [ search ...... | table a, b, c ], so for example here's a list of results: 

 a1 b1 c1

 a2 b2 c2

and dataset B is [search .... | table a, d, e, f]  such as:

a1 d1 e1 f1

a1 d2 e2 f2

a1 d3 e3 f3

a2 d4 e4 f4

a2 d5 e5 f5

What I need as a result of a new search is a new set of results C that look like this:

a1 b1 c1 d1 e1 f1

a1 b1 c1 d2 e2 f2

a1 b1 c1 d3 e3 f3

a2 b2 c2 d4 e4 f4

a2 b2 c2 d5 e5 f5

Thanks!

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Now that we are thinking in Splunk terms 😉, note that the ...... part in your illustration can make a difference in how best to construct a "solution".  So, I assume that Dataset A and B are NOT from the same sources, e.g., fields b and d must come from different sources, different sourcetypes, from different periods of time, even different indices.  Without such information, volunteers have to make assumptions that may or may not be helpful.  Where such have biggest impact would be when two datasets come from differing indices and/or time periods.

For simplicity, I will assume a common scenario when both datasets come from the same index and same time period.  Further assume that the only differentiating factor is sourcetype, A and B.  An effective OR would be between these two.

 

index=common_index ((sourcetype = A) OR (sourcetype = B))

 

Now, the above is often expressed as

 

index=common_index sourcetype IN (A, B)

 

Meanwhile, you may often have additional, differing search terms for A and B.  So you may want to keep those parentheses.  For example, you may want to restrict events to only those with fully populated fields of interest,

 

index=common_index ((sourcetype = A a=* b=* c=*) OR (sourcetype = B a=* d=* e=* f=*))

 

Anyway, my previous post only demonstrated how to leverage any key as "primary key", but did not include the final step in an outer join.  Here it is for your scenario:

 

| stats values(*) as * by a
| fields a b c d e f
| foreach *
    [mvexpand <<FIELD>>]

 

Using your sample datasets, the output is

abcdef
a1b1c1d1e1f1
a1b1c1d1e1f2
a1b1c1d1e1f3
a1b1c1d1e2f1
a1b1c1d1e2f2
a1b1c1d1e2f3
a1b1c1d1e3f1
a1b1c1d1e3f2
a1b1c1d1e3f3
a1b1c1d2e1f1
a1b1c1d2e1f2
a1b1c1d2e1f3
a1b1c1d2e2f1
a1b1c1d2e2f2
a1b1c1d2e2f3
a1b1c1d2e3f1
a1b1c1d2e3f2
a1b1c1d2e3f3
a1b1c1d3e1f1
a1b1c1d3e1f2

Here is an emulation that you can play with and compare with real data

 

| makeresults
| eval _raw = "a,b,c
a1,b1,c1
a2,b2,c2"
| multikv forceheader=1
| fields - _* linecount
| eval sourcetype = "A"
| append
    [makeresults
    | eval _raw = "a,d,e,f
a1,d1,e1,f1
a1,d2,e2,f2
a1,d3,e3,f3
a2,d4,e4,f4
a2,d5,e5,f5"
    | multikv forceheader=1
    | fields - _* linecount
    | eval sourcetype = "B"]
``` data emulation above ```

 

pgoldweic
Communicator

Thanks again for your response @yuanliu . This certainly clarifies why the OR would not work for me (these datasets are really timeless, and so the OR was resulting in an empty set). It also gives me some ideas of how to use the stats method (possibly in combination with append or similar) to try to get what I need, and your simulation comes in very handy for experimentation. Your particular solution does not really produce the results that I'm looking for (note that the resulting dataset in my example is ALL that I want as a result of the merge - nothing more), but using 'list' instead of 'values' appears to do so (just like @bowesmana suggested). Turns out that my needs changed since I posted and I happen to not need to produce this 'merge' query anymore, but I appreciate the help here and I think the suggestions were a good lesson on how to use some of the Splunk commands. Thanks again!

0 Karma

isoutamo
SplunkTrust
SplunkTrust

yuanliu
SplunkTrust
SplunkTrust

You are still thinking in SQL terms.  Splunk's "schema on the fly" means there is no predefined "primary key" in any data.  Any key, or combination of keys, or evaluation expression with any key or combination of keys, can be used as "primary key".


So for example, if I have 3 events with that field value in dataset A and 4 events with that particular field value in dataset B, then I expect to have 12 events in the result dataset (after the merge). What Splunk command/s would be useful to merge these datasets in this fashion? 

It is much more profitable for you to share some sample data (anonymize as necessary, or use proper mock data, in text), then describe the desired output.  From your verbal description, it seems that you want an outer join of sorts.

The simplest, and perhaps the fastest command would be stats.  To illustrate, I assume that you want to preserve all other fields in both datasets.

(<dataset A>) OR (<dataset B>)
| stats values(*) as * by not_a_primary_key_but_is_common_in_both_A_and_B

Hope this helps

Tags (1)
0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...