I am trying to merge two datasets which are results of two different searches on a particular field value common to both. The field I want to merge on is not a 'primary key' of any of the datasets, and therefore there's multiple events in each of these datasets with a given value of this field. My expected result is that for each event in the first dataset with a particular value of that field, I will end up producing n events in the resulting dataset, where n is the number of events in the second dataset that have that particular value in the field. So for example, if I have 3 events with that field value in dataset A and 4 events with that particular field value in dataset B, then I expect to have 12 events in the result dataset (after the merge). What Splunk command/s would be useful to merge these datasets in this fashion?
Note that when using stats values(*) as * the values aggregation will remove duplicates and you will end with a sorted set of unique values for each field. If you use stats list(*) as * then you will get all values of all fields with each value in sequence in each field from the second data set including duplicates.
Note that there is a 100 event limit when using list().
The differentiation is important if you need to keep some kind of correlation between the values of each joined result based on your common key.
Note that you can combined these on any number of common keys between the two data sets as there is no concept of primary key as @yuanliu says.
Thanks @yuanliu and @bowesmana for your replies. While I thought that my original goal was clear, I have to agree with @yuanliu that I need to give a more concrete example of it (although I disagree that I was really thinking in SQL terms; I simply used the concept of 'primary key' to more succinctly say what I was trying to). I also believe that @bowesmana is getting closer to what I need. BTW, I am not very familiar with the use of OR in this fashion (are there any restrictions on the shape of those datasets? BTW, I happen to get an empty result set when I 'OR' my result sets), and also, I am not using the word 'dataset' in the official way (I simply meant a set of results from a Splunk query). So, for clarification purposes, here's what I have:
Dataset A is [ search ...... | table a, b, c ], so for example here's a list of results:
a1 b1 c1
a2 b2 c2
and dataset B is [search .... | table a, d, e, f] such as:
a1 d1 e1 f1
a1 d2 e2 f2
a1 d3 e3 f3
a2 d4 e4 f4
a2 d5 e5 f5
What I need as a result of a new search is a new set of results C that look like this:
a1 b1 c1 d1 e1 f1
a1 b1 c1 d2 e2 f2
a1 b1 c1 d3 e3 f3
a2 b2 c2 d4 e4 f4
a2 b2 c2 d5 e5 f5
Thanks!
Now that we are thinking in Splunk terms 😉, note that the ...... part in your illustration can make a difference in how best to construct a "solution". So, I assume that Dataset A and B are NOT from the same sources, e.g., fields b and d must come from different sources, different sourcetypes, from different periods of time, even different indices. Without such information, volunteers have to make assumptions that may or may not be helpful. Where such have biggest impact would be when two datasets come from differing indices and/or time periods.
For simplicity, I will assume a common scenario when both datasets come from the same index and same time period. Further assume that the only differentiating factor is sourcetype, A and B. An effective OR would be between these two.
index=common_index ((sourcetype = A) OR (sourcetype = B))
Now, the above is often expressed as
index=common_index sourcetype IN (A, B)
Meanwhile, you may often have additional, differing search terms for A and B. So you may want to keep those parentheses. For example, you may want to restrict events to only those with fully populated fields of interest,
index=common_index ((sourcetype = A a=* b=* c=*) OR (sourcetype = B a=* d=* e=* f=*))
Anyway, my previous post only demonstrated how to leverage any key as "primary key", but did not include the final step in an outer join. Here it is for your scenario:
| stats values(*) as * by a
| fields a b c d e f
| foreach *
[mvexpand <<FIELD>>]
Using your sample datasets, the output is
a | b | c | d | e | f |
a1 | b1 | c1 | d1 | e1 | f1 |
a1 | b1 | c1 | d1 | e1 | f2 |
a1 | b1 | c1 | d1 | e1 | f3 |
a1 | b1 | c1 | d1 | e2 | f1 |
a1 | b1 | c1 | d1 | e2 | f2 |
a1 | b1 | c1 | d1 | e2 | f3 |
a1 | b1 | c1 | d1 | e3 | f1 |
a1 | b1 | c1 | d1 | e3 | f2 |
a1 | b1 | c1 | d1 | e3 | f3 |
a1 | b1 | c1 | d2 | e1 | f1 |
a1 | b1 | c1 | d2 | e1 | f2 |
a1 | b1 | c1 | d2 | e1 | f3 |
a1 | b1 | c1 | d2 | e2 | f1 |
a1 | b1 | c1 | d2 | e2 | f2 |
a1 | b1 | c1 | d2 | e2 | f3 |
a1 | b1 | c1 | d2 | e3 | f1 |
a1 | b1 | c1 | d2 | e3 | f2 |
a1 | b1 | c1 | d2 | e3 | f3 |
a1 | b1 | c1 | d3 | e1 | f1 |
a1 | b1 | c1 | d3 | e1 | f2 |
Here is an emulation that you can play with and compare with real data
| makeresults
| eval _raw = "a,b,c
a1,b1,c1
a2,b2,c2"
| multikv forceheader=1
| fields - _* linecount
| eval sourcetype = "A"
| append
[makeresults
| eval _raw = "a,d,e,f
a1,d1,e1,f1
a1,d2,e2,f2
a1,d3,e3,f3
a2,d4,e4,f4
a2,d5,e5,f5"
| multikv forceheader=1
| fields - _* linecount
| eval sourcetype = "B"]
``` data emulation above ```
Thanks again for your response @yuanliu . This certainly clarifies why the OR would not work for me (these datasets are really timeless, and so the OR was resulting in an empty set). It also gives me some ideas of how to use the stats method (possibly in combination with append or similar) to try to get what I need, and your simulation comes in very handy for experimentation. Your particular solution does not really produce the results that I'm looking for (note that the resulting dataset in my example is ALL that I want as a result of the merge - nothing more), but using 'list' instead of 'values' appears to do so (just like @bowesmana suggested). Turns out that my needs changed since I posted and I happen to not need to produce this 'merge' query anymore, but I appreciate the help here and I think the suggestions were a good lesson on how to use some of the Splunk commands. Thanks again!
Hi
here is old answer about SQL join equivalence in SPL. https://community.splunk.com/t5/Splunk-Search/What-is-the-relation-between-the-Splunk-inner-left-joi...
r. Ismo
You are still thinking in SQL terms. Splunk's "schema on the fly" means there is no predefined "primary key" in any data. Any key, or combination of keys, or evaluation expression with any key or combination of keys, can be used as "primary key".
So for example, if I have 3 events with that field value in dataset A and 4 events with that particular field value in dataset B, then I expect to have 12 events in the result dataset (after the merge). What Splunk command/s would be useful to merge these datasets in this fashion?
It is much more profitable for you to share some sample data (anonymize as necessary, or use proper mock data, in text), then describe the desired output. From your verbal description, it seems that you want an outer join of sorts.
The simplest, and perhaps the fastest command would be stats. To illustrate, I assume that you want to preserve all other fields in both datasets.
(<dataset A>) OR (<dataset B>)
| stats values(*) as * by not_a_primary_key_but_is_common_in_both_A_and_B
Hope this helps