Hi,
I have a single search that produces the following table where fieldA and fieldB are arbitrary strings that may be duplicate.
This is an exact representation of each event where each event may have a key "fieldA" or a key "fieldB" but not both but they always have an ID and Timestamp
Timestamp | ID | fieldA | fieldB |
11115 | 1 | "z" | |
11245 | 1 | "a" | |
11378 | 1 | "b" | |
11768 | 1 | "d" | |
11879 | 1 | "d" | |
12550 | 2 | "c" | |
13580 | 2 | "e" | |
15703 | 2 | "f" | |
18690 | 3 | "g" |
and I need help to transform the data as follows.
ID | fieldA | fieldB |
1 | "b" | "d" |
1 | "b" | "d" |
2 | "c" | "e" |
2 | "c" | "f" |
3 | "g" |
Thanks to suggestion below, I have tried `stats latest(fieldA) list(fieldB)` but I would prefer to not have any multivalued fields
For every distinct value for "fieldA", the latest record with that value would be kept and any records with that ID occuring before that record would be discard.
There is no requirement to have 2 searches. Hope that makes it more clear and easier.
Taking on your changes, please explain your logic for excluding "z" in your expected results. Also, your example does not have any duplicates so it is unclear, from the expected results, how you want duplicates treated. Having an accurate representation of your data might help clarify this.
Assuming "z" was supposed to be in the results, then my previous solution still works - the mvexpand expands the multivalue field created by list()
| makeresults format=csv data="Timestamp,ID,fieldA,fieldB
11115,1,,z
11245,1,a,
11378,1,b,
11768,1,,d
12550,2,c,
13580,2,,e
15703,2,,f
18690,3,,g"
| stats latest(fieldA) as fieldA list(fieldB) as fieldB by ID
| mvexpand fieldB
Where does 2-a2 come from?
Assuming values in fieldB are unique, you could try something like this
| makeresults format=csv data="ID,fieldA,fieldB
1,1-a1,
1,1-a2,
1,,1-b1
2,2-a1,
2,,2-b1
2,,2-b2
3,,1-b1"
``` The lines above emulate the data you shared ```
| stats latest(fieldA) as fieldA list(fieldB) as fieldB by ID
| mvexpand fieldB
Just as you say, Splunk is not SQL. So, forget join. Please let us know what is the nature of the two searches, how close are they? What are their search periods? Most of the time, you shouldn't run two separate searches, but instead, combine the two into one search, then try to get the result you need from that one search.
Criteria being if there are duplicate values in fieldA, only the row with the latest value is kept
and each row with fieldB joined to fieldA on same ID. or if there are no values for fieldA, just join with null/blank value
Ideally, we can also throw away all rows with col fieldB that have a timestamp earlier than fieldA but not a hard requirement if that adds too much complexity to the query
Here, you talk about "latest" and "earlier". But your mock data illustration contains no time information. How are volunteers supposed to help?
Now, if you MUST run the two searches separately, yes, there are ways to produce right join output in SPL without using join command which most Splunkers advise against. But let's start at the ABCs of asking answerable questions in a data analytics forum. (That's right, this is not a SQL forum.) Here are four golden rules that I call Four Commandments:
Start from here.
Thank you for the feedback, I have updated the post
.