I have three tables. Each has one or more ID fields (out of ID_A, ID_B, ID_C) and assigns values Xn, Yn, Zn to these IDs. In effect, the tables each contain a fragment of information from a set of objects 1...5.
Table X:
ID_A | ID_B | X1 | X2 |
A1 | B1 | X1_1 | X2_1 |
A2 | B2 | X1_2a | X2_2 |
A2 | B2 | X1_2b | X2_2 |
A3 | B3 | X1_3 | X2_3 |
Table Y:
ID_A | ID_B | Y1 | Y2 |
A2 | B2 | Y1_2 | |
A2 | B2 | Y2_2 | |
A3 | B3 | Y2_3a | |
A3 | B3 | Y2_3b | |
A4 | B4 | Y1_4 | Y2_4 |
Table Z:
ID_B | ID_C | Z1 |
B1 | C1 | Z1_1 |
B3 | C3 | Z1_3 |
B5 | C5 | Z1_5 |
How can I create the superset of all three tables, i.e. reconstruct the "full picture" about obects 1..5 as good as possible?
I tried with union and join in various ways, but I keep tripping over the following obstacles:
Desired result:
ID_A | ID_B | ID_C | X1 | X2 | Y1 | Y2 | Z1 |
A1 | B1 | C1 | X1_1 | X2_1 | Z1_1 | ||
A2 | B2 | X1_2a | X2_2 | Y1_2 | Y2_2 | ||
A2 | B2 | X1_2b | X2_2 | Y1_2 | Y2_2 | ||
A3 | B3 | X1_3 | X2_3 | Y2_3a | Z1_3 | ||
A3 | B3 | X1_3 | X2_3 | Y2_3b | Z1_3 | ||
A4 | B4 | Y1_4 | Y2_4 | ||||
B5 | C5 | Z1_5 |
Sample data:
| makeresults
| eval _raw="ID_A;ID_B;X1;X2
A1;B1;X1_1;X2_1
A2;B2;X1_2A;X2_2
A2;B2;X1_2B;X2_2
A3;B3;X1_3;X2_3
"
| multikv forceheader=1
| table ID_A, ID_B, X1, X2
| append [
| makeresults
| eval _raw="ID_A;ID_B;Y1;Y2
A2;B2;Y1_2;
A2;B2;;Y2_2
A3;B3;Y1_3;Y2_3A
A3;B3;Y1_3;Y2_3B
A4;B4;Y1_4;Y2_4
"
| multikv forceheader=1
| table ID_A, ID_B, Y1, Y2
]
| append [
| makeresults
| eval _raw="ID_B;ID_C;Z1
B1;C1;Z1_1
B3;C3;Z1_3
B5;C5;Z1_5
"
| multikv forceheader=1
| table ID_B, ID_C, Z1
]
| table ID_A, ID_B, ID_C, X1, X2, Y1, Y2, Z1
Small correction: In the result table, rows 4 and 5, ID_C should be "C3" (from table Z).
First of all, thanks for creating the dummy data, it was very useful, although slightly wrong.
I corrected the data to remove Y1_3 which doesn't exist in your tables nor your desired results.
Try something like this
| makeresults
| eval _raw="ID_A;ID_B;X1;X2
A1;B1;X1_1;X2_1
A2;B2;X1_2A;X2_2
A2;B2;X1_2B;X2_2
A3;B3;X1_3;X2_3
"
| multikv forceheader=1
| table ID_A, ID_B, X1, X2
| append [
| makeresults
| eval _raw="ID_A;ID_B;Y1;Y2
A2;B2;Y1_2;
A2;B2;;Y2_2
A3;B3;;Y2_3A
A3;B3;;Y2_3B
A4;B4;Y1_4;Y2_4
"
| multikv forceheader=1
| table ID_A, ID_B, Y1, Y2
]
| append [
| makeresults
| eval _raw="ID_B;ID_C;Z1
B1;C1;Z1_1
B3;C3;Z1_3
B5;C5;Z1_5
"
| multikv forceheader=1
| table ID_B, ID_C, Z1
]
| table ID_A, ID_B, ID_C, X1, X2, Y1, Y2, Z1
| eventstats values(ID_C) as ID_C values(Z1) as Z1 by ID_B
| eventstats values(X1) as X1 values(X2) as X2 values(Y1) as Y1 values(Y2) as Y2 by ID_A ID_B
| eventstats values(X1) as X1 values(X2) as X2 values(Y1) as Y1 values(Y2) as Y2 values(ID_A) as ID_A by ID_C ID_B
| mvexpand Y2
| mvexpand X1
| fillnull value="N/A"
| streamstats count by ID_A ID_B ID_C X1 X2 Y1 Y2 Z1
| where count==1
| foreach *
[| eval <<FIELD>>=if(<<FIELD>>=="N/A",null(),<<FIELD>>)]
| fields - count
Note that mvexpand can cause memory issues, so you need to check your job status to ensure the search works correctly with real data.
@ITWhisperer thanks for your reply. You have definitely put me on the right path.
My original data has more ID and attribute fields, so I needed to find a way to generalize your solution further. I had a bit of a struggle to understand why you chose these particular three eventstats lines with their respective values(...) and by ... clauses.
I believe a more generic recipe would be like this:
If you have n ID fields and k attribute fields, do one eventstats ... by ID for each of the n ID fields:
| eventstats
values(ID_2) as ID_2, .., values(ID_n) as ID_n,
values(attr_1) as attr_1, .., values(attr_k) as attr_k
by ID_1
| eventstats
values(ID_1) as ID_1, values(ID_3) as ID_3, .., values(ID_n) as ID_n,
values(attr_1) as attr_1, .., values(attr_k) as attr_k
by ID_2
| eventstats .. by ID_3
| eventstats .. by ID_n-1
| eventstats
values(ID_1) as ID_1, .., values(ID_n-1) as ID_n-1,
values(attr_1) as attr_1, .., values(attr_k) as attr_k
by ID_n
The subsequent mvexpand and streamtstats count ... where count==1 can be simplified as:
| fillnull value="N/A" ```or stats by fieldlist ignores rows with at least one null value in fieldlist```
| stats count by ID_1, ..., ID_n, attr_1, ..., attr_k | fields - count
| foreach * [ | eval <<FIELD>>=if(<<FIELD>>=="N/A",null(),<<FIELD>>) ] ```undo the N/A replacement```
This stanza even expands possible multivalues in all of the n attributes - otherwise each attribute field with potential multivalues would need its own explicit mvexpand <attr> . It also avoids mvexpand's potential memory issue.
For my own sample data, it would read:
| makeresults
| eval _raw="ID_A;ID_B;X1;X2
A1;B1;X1_1;X2_1
A2;B2;X1_2A;X2_2
A2;B2;X1_2B;X2_2
A3;B3;X1_3;X2_3
"
| multikv forceheader=1
| table ID_A, ID_B, X1, X2
| append [
| makeresults
| eval _raw="ID_A;ID_B;Y1;Y2
A2;B2;Y1_2;
A2;B2;;Y2_2
A3;B3;;Y2_3A
A3;B3;;Y2_3B
A4;B4;Y1_4;Y2_4
"
| multikv forceheader=1
| table ID_A, ID_B, Y1, Y2
]
| append [
| makeresults
| eval _raw="ID_B;ID_C;Z1
B1;C1;Z1_1
B3;C3;Z1_3
B5;C5;Z1_5
"
| multikv forceheader=1
| table ID_B, ID_C, Z1
]
| table ID_A, ID_B, ID_C, X1, X2, Y1, Y2, Z1
```--- relevant code starts here ---```
| eventstats
values(ID_B) as ID_B, values(ID_C) as ID_C,
values(X1) as X1, values(X2) as X2, values(Y1) as Y1, values(Y2) as Y2, values(Z1) as Z1
by ID_A
| eventstats
values(ID_A) as ID_A, values(ID_C) as ID_C,
values(X1) as X1, values(X2) as X2, values(Y1) as Y1, values(Y2) as Y2, values(Z1) as Z1
by ID_B
| eventstats
values(ID_A) as ID_A, values(ID_B) as ID_B,
values(X1) as X1, values(X2) as X2, values(Y1) as Y1, values(Y2) as Y2, values(Z1) as Z1
by ID_C
| fillnull value="N/A" ```or stats by fieldlist ignores rows with at least one null value in fieldlist```
| stats count by ID_A, ID_B, ID_C, X1, X2, Y1, Y2, Z1 | fields - count
| foreach * [ | eval <<FIELD>>=if(<<FIELD>>=="N/A",null(),<<FIELD>>) ] ```undo the N/A replacement```
Please let me know if I have overlooked something.