Splunk Search

Looking to rewrite a join across several indexes with somewhat unusual relationships


(I am reposting this question from email, with permission from the person who emailed)

I need to basically join 3 indexes where the ‘join’ info is in a 4th index. The 3 former indexes have around 50000 entries while the 4th index around 500000.

The fields in the indexes are:

Indexes containing the data: Index A, B, C, …:  name status id

Index containing relationships:  Index REL:        id parent child

The parent and child values in REL match the id value in A, B and C.
And also note that the id values in A, B and C never collide, in other words the "single id space" of the REL events has no overlaps across the entities in A, B, C

From A to B is normally a 1 to many relation, and from B to C is a many-to-many relation.

The specific use case here is that A represents Applications, B represents Deployments of those applications, and C represents Servers implementing those deployments.

A server can be used for several deployments hence the many-to-many relation here.


The obvious way would be to do something like:




| rename id as A_id,  name as Aname, status as Astatus
| join max=0 type=left A_id
       [ | search index=REL | rename parent as A_id, child as B_id ]
| join max=0 type=left B_id
       [ | search index=B
         | rename is as B_id, name as Bname, status as Bstatus ]
| join max=0 type=left B_id
       [ | search index=REL | rename parent as B_id, child as C_id ]
| join max=0 type=left C_id
       [ | search index=C
         | rename id as C_id, name as Cname, status as Cstatus ]
| table Aname Astatus Bname Bstatus Cname Cstatus




This, of course, fails miserably because the join only returns 10000 results while the REL index has 400000 events…

I can rewrite the first join as:




index IN(A REL)
| eval parent=if(index=REL, parent, id),
       child =if(index=REL, child,  id)
| stats values(name) as Aname values(statu) as Astatus  values(child) as childs
    by  parent
| table Aname Astatus childs




But I’m at a loss how to squeeze in the other indexes and relation…
And I also have some hope that there's a way to avoid join entirely.

UPDATE: Here is a run-anywhere search to fabricate some sample input rows

| makeresults
| fields - _time
| eval data="1,A,appl1,,;2,A,appl2,,;3,D,depl1,,;4,D,depl2,,;5,D,depl3,,;6,S,serv1,,;7,S,serv2,,;8,S,serv3,,;9,S,serv4,,;10,R,,1,3;11,R,,2,4;12,R,,2,5;13,R,,3,6;14,R,,4,7;15,R,,5,8;16,R,,5,9",
     data=split(data, ";")
| mvexpand data
| rex field=data "(?<sys_id>[^,]*),(?<type>[^,]*),(?<name>[^,]*),(?<parent>[^,]*),(?<child>[^,]*)"
| fields sys_id type name parent child
| eval parent=if(parent=="",null(),parent), child=if(child=="",null(),child)

And the desired output,  is rows that map app1 to depl1 and serv1   and that map app2 to depl2,depl3 and serv2,serv3,serv4

Labels (1)
0 Karma


(I'm the person that originally emailed the question to sideview.)

Thanks all for the good discussion.


I did do an implementation using lookup tables. It looks much simpler but you depend on creating the lookups in the first place: 


index=A ...
| outputlookup A.csv



before combining the data:



| inputlookup A.csv
| rename id as A_id, name as A_name

| lookup REL.csv parent as A_id output child as B_id
| where isnotnull(B_id)
| mvexpand B_id

| lookup B.csv id as B_id output name as B_name
| where isnotnull(B_name)

| lookup REL.csv parent as B_id output child as C_id
| where isnotnull(C_id)
| mvexpand C_id

| lookup C.csv id as C_id output name as C_name
| mvexpand C_name

| fields - A_id B_id C_id
| sort  0 A_name B_name C_name



Although it  duplicates possibly a lot of data, this seems easier to understand and to extend to  even more relationships.

0 Karma


Simple to understand and simple to support are good decisions.

Well done.

0 Karma


I see this as a nontrivial version of Splunk soup.  I'd proceed like this...

index=A OR index=REL or index=B OR index=C
| fields index parent child name status id

| rename COMMENT as "double the REL records, levaing the others single"
| eval myfan=mvrange(0,if(index="REL",2,1))
| mvexpand myfan

| rename COMMENT as "set up match keys and data fields"
| eval A_id=case(index="A",id, index="REL" AND myfan=0,parent)
| eval B_id=case(index="B",id, index="REL" AND myfan=0,child, index="REL" AND myfan=1,parent)
| eval C_id=case(index="C",id, index="REL" AND myfan=1,child)
| eval A_name=case(index="A",name)
| eval A_status=case(index="A",status)
| eval B_name_status=case(index="B",name."!!!!".status)
| eval C_name_status=case(index="C",name."!!!!".status)

At this point records look like this

index=A id name status A_id A_name A_status
index=B id name status B_id B_name_status
index=C id name status C_id C_name_status

index=REL myfan=0 parent child A_id B_id
index=REL myfan=1 parent child B_id C_id


Then I'd proceed like this...

| rename COMMENT as "reduce to required fields with one of these two"
| fields - id name status parent child
| fields index myfan A_id A_name A_status B_id B_name_status C_id C_name_status

| rename COMMENT as "roll data from REL myfan 0 to A, then myfan=1 to A, tehn drop REL"
| eventstats values(eval(case(myfan=0,B_id)) as B_id by A_id
| eventstats values(eval(case(myfan=1,C_id)) as C_id by B_id
| where index!="REL"

| rename COMMENT as "now we have only A, B, C records, and the A records have all relevant keys."
| rename COMMENT as "Roll B record to A then drop B"
| eventstats values(B_name_status) as B_name_status by B_id
| where index!="B"

| rename COMMENT as "Roll C record to A then drop C"
| eventstats values(C_name_status) as C_name_status by C_id
| where index!="C"
| rename COMMENT as "Above could be a stats"

| rename COMMENT as "Add placeholders to handle potential NULLS"
| eval B_name_status=coalesce(B_name_status,"N/A!!!!N/A")
| eval C_name_status=coalesce(C_name_status,"N/A!!!!N/A")

| rename COMMENT as "split up the records, then the fields"
| mvexpand C_name_status
| mvexpand B_name_status
| eval B_name=mvindex(split(B_name_status,"!!!!"),0)
| eval B_status=mvindex(split(B_name_status,"!!!!"),1)
| eval C_name=mvindex(split(C_name_status,"!!!!"),0)
| eval C_status=mvindex(split(C_name_status,"!!!!"),1)

| rename COMMENT as "drop unneeded fields"
| table A_name A_status B_name B_status C_name C_status

That's all air code, so you'd have to shake it down with a small subset of the records before running the whole data set.  



0 Karma


Heads up: mvexpand has limits you might run into in this scenario:



0 Karma

One solution would be if we could shift the deployment stuff over onto the application rows,  and you need to shift the server stuff over as well.  Or going the other way, if we could shift all the pieces we need ontot he actual "relationship" rows, and then just do a stats that only picks things up from them.

Either way - that sort of feeling, where you need to copy a couple phases of things around using different ids,  is when you should think of eventstats or streamstats.

In this particular case I cannot make it very simple.   It involves a lot of "shifts" and it's ugly.



| makeresults
| fields - _time
| eval data="1,A,appl1,,;2,A,appl2,,;3,D,depl1,,;4,D,depl2,,;5,D,depl3,,;6,S,serv1,,;7,S,serv2,,;8,S,serv3,,;9,S,serv4,,;10,R,,1,3;11,R,,2,4;12,R,,2,5;13,R,,3,6;14,R,,4,7;15,R,,5,8;16,R,,5,9",
     data=split(data, ";")
| mvexpand data
| rex field=data "(?<sys_id>[^,]*),(?<type>[^,]*),(?<name>[^,]*),(?<parent>[^,]*),(?<child>[^,]*)"
| fields sys_id type name parent child
| eval parent=if(parent=="",null(),parent), child=if(child=="",null(),child)

| eval application_id=case(type="A",sys_id, type="R",parent),
       server_name = case(type="S",name)
| eventstats values(server_name) as server_name by server_id
| eval deployment_id=case(type=="D",sys_id,type=="R" AND isnull(server_name), child, true(),parent)
| eventstats values(server_id) as server_id values(deployment_id) as deployment_id by application_id
| fillnull deployment_id value=""
| mvexpand deployment_id
| eventstats values(deployment_name) as deployment_name values(server_name) as server_name by deployment_id
| eventstats values(application_name) as application_name by application_id
| fields application_name deployment_name server_name
| stats values(*) as * by application_name



The mocked up rows -- I had to add an extra eval to try and make it a little more like data.   emptystring values and null() values are pretty different.   I manually am turning some thing sback into null() values where it matters.

The big eval statement -- it's a little careless in how it hands out application_id but it doesnt matter - likewise with server_id
the next two lines - eventstats and then the eval that makes deployment_id -   deployment id is trickier but we basically use the absence of the server_name field to conclude that the "R" row is a application-to-deployment row.
the fillnull - mvexpand will throw away values that are actual nulls, so we do an ugly workaround to replace with emptystring values.
It's a bit of a shell game.  It copies the data from row to row,  step by step.  tedious and painful.
And if there's a way to replace it all with a simple pair of stats commands,  it's certainly beyond my abilities at the moment.

SO - there's another path.  And that's to break it into smaller problems.
I would go that way in this case.   Make a static file-based lookup mapping deployment id to deployment name.
likewise from server id to server name.   and from application id to application name.
Then you can have 3 explicit lookup statements, painting these names onto all the rows.  And between eval and lookup you can actually load up the "relationship" rows with ALL of the ids and names,  and then one simple stats can do it in a single fell swoop then.  I'll come back and write up specific SPL to flesh out this way.
Tags (1)
0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!