Superset of data fragments / Compress diagonal dat...

rikinet · ‎03-26-2024

I have three tables. Each has one or more ID fields (out of ID_A, ID_B, ID_C) and assigns values Xn, Yn, Zn to these IDs. In effect, the tables each contain a fragment of information from a set of objects 1...5.

Table X:

ID_A	ID_B	X1	X2
A1	B1	X1_1	X2_1
A2	B2	X1_2a	X2_2
A2	B2	X1_2b	X2_2
A3	B3	X1_3	X2_3

Table Y:

ID_A	ID_B	Y1	Y2
A2	B2	Y1_2
A2	B2		Y2_2
A3	B3		Y2_3a
A3	B3		Y2_3b
A4	B4	Y1_4	Y2_4

Table Z:

ID_B	ID_C	Z1
B1	C1	Z1_1
B3	C3	Z1_3
B5	C5	Z1_5

How can I create the superset of all three tables, i.e. reconstruct the "full picture" about obects 1..5 as good as possible?

I tried with union and join in various ways, but I keep tripping over the following obstacles:

The 1:n relation between ID and values (which should remain expanded as individual rows)
Empty fields in between (bad for stats list(...) or stats values(...) because of different-sized MV results)
There is no single table that has references to all objects (e.g. object 5 only present in table Z).

Desired result:

ID_A	ID_B	ID_C	X1	X2	Y1	Y2	Z1
A1	B1	C1	X1_1	X2_1			Z1_1
A2	B2		X1_2a	X2_2	Y1_2	Y2_2
A2	B2		X1_2b	X2_2	Y1_2	Y2_2
A3	B3		X1_3	X2_3		Y2_3a	Z1_3
A3	B3		X1_3	X2_3		Y2_3b	Z1_3
A4	B4				Y1_4	Y2_4
	B5	C5					Z1_5

Sample data:

| makeresults
| eval _raw="ID_A;ID_B;X1;X2
A1;B1;X1_1;X2_1
A2;B2;X1_2A;X2_2
A2;B2;X1_2B;X2_2
A3;B3;X1_3;X2_3
"
| multikv forceheader=1
| table ID_A, ID_B, X1, X2

| append [
| makeresults
| eval _raw="ID_A;ID_B;Y1;Y2
A2;B2;Y1_2;
A2;B2;;Y2_2
A3;B3;Y1_3;Y2_3A
A3;B3;Y1_3;Y2_3B
A4;B4;Y1_4;Y2_4
"
| multikv forceheader=1
| table ID_A, ID_B, Y1, Y2
]

| append [
| makeresults
| eval _raw="ID_B;ID_C;Z1
B1;C1;Z1_1
B3;C3;Z1_3
B5;C5;Z1_5
"
| multikv forceheader=1
| table ID_B, ID_C, Z1
]

| table ID_A, ID_B, ID_C, X1, X2, Y1, Y2, Z1

rikinet · ‎03-27-2024

Small correction: In the result table, rows 4 and 5, ID_C should be "C3" (from table Z).

ITWhisperer · ‎03-27-2024

First of all, thanks for creating the dummy data, it was very useful, although slightly wrong.

I corrected the data to remove Y1_3 which doesn't exist in your tables nor your desired results.

Try something like this

| makeresults
| eval _raw="ID_A;ID_B;X1;X2
A1;B1;X1_1;X2_1
A2;B2;X1_2A;X2_2
A2;B2;X1_2B;X2_2
A3;B3;X1_3;X2_3
"
| multikv forceheader=1
| table ID_A, ID_B, X1, X2

| append [
| makeresults
| eval _raw="ID_A;ID_B;Y1;Y2
A2;B2;Y1_2;
A2;B2;;Y2_2
A3;B3;;Y2_3A
A3;B3;;Y2_3B
A4;B4;Y1_4;Y2_4
"
| multikv forceheader=1
| table ID_A, ID_B, Y1, Y2
]

| append [
| makeresults
| eval _raw="ID_B;ID_C;Z1
B1;C1;Z1_1
B3;C3;Z1_3
B5;C5;Z1_5
"
| multikv forceheader=1
| table ID_B, ID_C, Z1
]

| table ID_A, ID_B, ID_C, X1, X2, Y1, Y2, Z1
| eventstats values(ID_C) as ID_C values(Z1) as Z1 by ID_B
| eventstats values(X1) as X1 values(X2) as X2 values(Y1) as Y1 values(Y2) as Y2 by ID_A ID_B
| eventstats values(X1) as X1 values(X2) as X2 values(Y1) as Y1 values(Y2) as Y2 values(ID_A) as ID_A by ID_C ID_B
| mvexpand Y2
| mvexpand X1
| fillnull value="N/A"
| streamstats count by ID_A ID_B ID_C X1 X2 Y1 Y2 Z1
| where count==1
| foreach *
    [| eval <<FIELD>>=if(<<FIELD>>=="N/A",null(),<<FIELD>>)]
| fields - count

Note that mvexpand can cause memory issues, so you need to check your job status to ensure the search works correctly with real data.

rikinet · ‎04-03-2024

@ITWhisperer thanks for your reply. You have definitely put me on the right path.

My original data has more ID and attribute fields, so I needed to find a way to generalize your solution further. I had a bit of a struggle to understand why you chose these particular three eventstats lines with their respective values(...) and by ... clauses.

I believe a more generic recipe would be like this:

If you have n ID fields and k attribute fields, do one eventstats ... by ID for each of the n ID fields:

| eventstats
  values(ID_2) as ID_2, .., values(ID_n) as ID_n,
  values(attr_1) as attr_1, .., values(attr_k) as attr_k
  by ID_1
| eventstats
  values(ID_1) as ID_1, values(ID_3) as ID_3, .., values(ID_n) as ID_n,
  values(attr_1) as attr_1, .., values(attr_k) as attr_k
  by ID_2
| eventstats .. by ID_3
| eventstats .. by ID_n-1
| eventstats
  values(ID_1) as ID_1, .., values(ID_n-1) as ID_n-1,
  values(attr_1) as attr_1, .., values(attr_k) as attr_k
  by ID_n

The subsequent mvexpand and streamtstats count ... where count==1 can be simplified as:

| fillnull value="N/A" ```or stats by fieldlist ignores rows with at least one null value in fieldlist```
| stats count by ID_1, ..., ID_n, attr_1, ..., attr_k | fields - count
| foreach * [ | eval <<FIELD>>=if(<<FIELD>>=="N/A",null(),<<FIELD>>) ] ```undo the N/A replacement```

This stanza even expands possible multivalues in all of the n attributes - otherwise each attribute field with potential multivalues would need its own explicit mvexpand <attr> . It also avoids mvexpand's potential memory issue.

For my own sample data, it would read:

| makeresults
| eval _raw="ID_A;ID_B;X1;X2
A1;B1;X1_1;X2_1
A2;B2;X1_2A;X2_2
A2;B2;X1_2B;X2_2
A3;B3;X1_3;X2_3
"
| multikv forceheader=1
| table ID_A, ID_B, X1, X2

| append [
| makeresults
| eval _raw="ID_A;ID_B;Y1;Y2
A2;B2;Y1_2;
A2;B2;;Y2_2
A3;B3;;Y2_3A
A3;B3;;Y2_3B
A4;B4;Y1_4;Y2_4
"
| multikv forceheader=1
| table ID_A, ID_B, Y1, Y2
]

| append [
| makeresults
| eval _raw="ID_B;ID_C;Z1
B1;C1;Z1_1
B3;C3;Z1_3
B5;C5;Z1_5
"
| multikv forceheader=1
| table ID_B, ID_C, Z1
]
| table ID_A, ID_B, ID_C, X1, X2, Y1, Y2, Z1

```--- relevant code starts here ---```

| eventstats
  values(ID_B) as ID_B, values(ID_C) as ID_C,
  values(X1) as X1, values(X2) as X2, values(Y1) as Y1, values(Y2) as Y2, values(Z1) as Z1
  by ID_A
| eventstats
  values(ID_A) as ID_A, values(ID_C) as ID_C,
  values(X1) as X1, values(X2) as X2, values(Y1) as Y1, values(Y2) as Y2, values(Z1) as Z1
  by ID_B
| eventstats
  values(ID_A) as ID_A, values(ID_B) as ID_B,
  values(X1) as X1, values(X2) as X2, values(Y1) as Y1, values(Y2) as Y2, values(Z1) as Z1
  by ID_C

| fillnull value="N/A" ```or stats by fieldlist ignores rows with at least one null value in fieldlist```
| stats count by ID_A, ID_B, ID_C, X1, X2, Y1, Y2, Z1 | fields - count
| foreach * [ | eval <<FIELD>>=if(<<FIELD>>=="N/A",null(),<<FIELD>>) ] ```undo the N/A replacement```

Please let me know if I have overlooked something.

Superset of data fragments / Compress diagonal data with extra obstacles

fields

join

stats

table

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation