Help with self join

tly22 · ‎08-13-2024

Hi,

I have a single search that produces the following table where fieldA and fieldB are arbitrary strings that may be duplicate.

This is an exact representation of each event where each event may have a key "fieldA" or a key "fieldB" but not both but they always have an ID and Timestamp

Timestamp	ID	fieldA	fieldB
11115	1		"z"
11245	1	"a"
11378	1	"b"
11768	1		"d"
11879	1		"d"
12550	2	"c"
13580	2		"e"
15703	2		"f"
18690	3		"g"

and I need help to transform the data as follows.

ID	fieldA	fieldB
1	"b"	"d"
1	"b"	"d"
2	"c"	"e"
2	"c"	"f"
3		"g"

Thanks to suggestion below, I have tried `stats latest(fieldA) list(fieldB)` but I would prefer to not have any multivalued fields

For every distinct value for "fieldA", the latest record with that value would be kept and any records with that ID occuring before that record would be discard.

There is no requirement to have 2 searches. Hope that makes it more clear and easier.

ITWhisperer · ‎08-14-2024

Taking on your changes, please explain your logic for excluding "z" in your expected results. Also, your example does not have any duplicates so it is unclear, from the expected results, how you want duplicates treated. Having an accurate representation of your data might help clarify this.

Assuming "z" was supposed to be in the results, then my previous solution still works - the mvexpand expands the multivalue field created by list()

| makeresults format=csv data="Timestamp,ID,fieldA,fieldB
11115,1,,z
11245,1,a,
11378,1,b,
11768,1,,d
12550,2,c,
13580,2,,e
15703,2,,f
18690,3,,g"
| stats latest(fieldA) as fieldA list(fieldB) as fieldB by ID
| mvexpand fieldB

ITWhisperer · ‎08-14-2024

Where does 2-a2 come from?

Assuming values in fieldB are unique, you could try something like this

| makeresults format=csv data="ID,fieldA,fieldB
1,1-a1,
1,1-a2,
1,,1-b1
2,2-a1,
2,,2-b1
2,,2-b2
3,,1-b1"
``` The lines above emulate the data you shared ```
| stats latest(fieldA) as fieldA list(fieldB) as fieldB by ID
| mvexpand fieldB

yuanliu · ‎08-13-2024

Just as you say, Splunk is not SQL. So, forget join. Please let us know what is the nature of the two searches, how close are they? What are their search periods? Most of the time, you shouldn't run two separate searches, but instead, combine the two into one search, then try to get the result you need from that one search.

Criteria being if there are duplicate values in fieldA, only the row with the latest value is kept
and each row with fieldB joined to fieldA on same ID. or if there are no values for fieldA, just join with null/blank value

Ideally, we can also throw away all rows with col fieldB that have a timestamp earlier than fieldA but not a hard requirement if that adds too much complexity to the query

Here, you talk about "latest" and "earlier". But your mock data illustration contains no time information. How are volunteers supposed to help?

Now, if you MUST run the two searches separately, yes, there are ways to produce right join output in SPL without using join command which most Splunkers advise against. But let's start at the ABCs of asking answerable questions in a data analytics forum. (That's right, this is not a SQL forum.) Here are four golden rules that I call Four Commandments:

Illustrate data input (in raw text, anonymize as needed), whether they are raw events or output from a search that volunteers here do not have to look at.
Illustrate the desired output from illustrated data.
Explain the logic between illustrated data and desired output without SPL.
If you also illustrate attempted SPL, illustrate actual output and compare with desired output, explain why they look different to you if that is not painfully obvious.

Start from here.

tly22 · ‎08-14-2024

Thank you for the feedback, I have updated the post

tly22 · ‎08-13-2024

.

Help with self join

join

Monitoring Amazon Elastic Kubernetes Service (EKS)

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation

Explore the Latest Educational Offerings from Splunk (November Releases)