Solved: How to join data in one source with another join t...

s2jagrif · ‎06-08-2017

I am trying to join data in one source to another join that joins two searches. My goal is to capture VM information where VM resides on datastore within a datastore group. The flow would be:

vm - contains vm information I want and has producer_name which is the datastore where it resides

datastore - contains the database_store name (rename to producer_name) that I need in order to pull VM data that has entity_id that joins to a datastore group

entitygroupmembers - contains the group_name that I need for joining to the datastore information for those contained within.

The query returns information for the search datastore join with entitygroupmember but the moment I added the 3rd join adding vm to join with producer_name it returns zero rows. I am new to Splunk so any help would be greatly appreciated in helping to resolve this query.

Query Example:

index_vmturbo sourcetype="vmturbo:vm" | join producer_name max=0 [search index=vmturbo sourcetype="vmturbo:datastore" | fields datastore_name, entity_id | rename datastore_name as producer_name, entity_id as join_id] | join join_id [search index=vmturbo  sourcetype="vmturbo:entitygroupmembers" earliest=0 latest=now group_name=WDMix-SQL-TCL01-WM13-SQL-ALL-Database-Drives | fields entity_id | rename entity_id as join_id] | table vm_name

sideview · ‎06-08-2017

Well, you're far better off learning how to use the eval and stats commands to use join less here, quite possibly to avoid using join here at all.

The best thing about join, is how intuitive it seems when you start using it. That's kind of the only good thing though. There are a lot of reasons to avoid join. I won't go into them much but they include pitfalls where your results don't include all the data so the final numbers are wrong and in general join searches will be far slower and you'll be pulling lots of data back to the search head instead of doing work on the indexer nodes.

Let's take the first join separately.

index_vmturbo sourcetype="vmturbo:vm" 
| join producer_name max=0 [
  search index=vmturbo sourcetype="vmturbo:datastore" 
  | fields datastore_name, entity_id 
  | rename datastore_name as producer_name, entity_id as join_id
]

This can certainly be rewritten to be done with just eval and stats.

index=vmturbo (sourcetype="vmturbo:vm" OR  sourcetype="vmturbo:datastore") 
| eval producer_name=if(sourcetype="vmturbo:datastore",datastore_name,producer_name)
| eval join_id=if(sourcetype="vmturbo:datastore",entity_id,join_id)
| stats count by producer_name join_id

The second "join" can be rewritten to not use the join command by very similar methods, except....

that I see you're making that search run over all time. Presumably the main overall search is being run with some specific time range like "last 7 days".

From this point, there are absolutely still several good options open to still not use join here. Rather than guess which one might be appropriate, I'll wait for a comment back from you giving more details on the explicit timerange here, and then I'll update my answer.

UPDATE ----

Alright, if the "groupmember" data does indeed have to represent "all vm's ever", then I recommend just baking this off into something else to give you vastly better performance. The simplest and most versatile, but perhaps a little odd, is to bake the information into a lookup.

1) Create a lookup. Here I'm calling it "all_vms". If you've never done this, you'll have to peruse the Splunk docs around "file based lookups"
2) run this search over all time.

index=vmturbo  sourcetype="vmturbo:entitygroupmembers"  
| fields entity_id vm_name
| rename entity_id as join_id
| outputlookup all_vms

3) set up a scheduled search to run this slightly different search every so often to add new vm's to the list.

index=vmturbo  sourcetype="vmturbo:entitygroupmembers"  
| fields entity_id vm_name
| rename entity_id as join_id
| inputlookup append=t all_vms
| dedup entity_id vm_name
| outputlookup all_vms

At this point you'll have a lookup file with two columns entity_id and vm_name, it'll stay relatively up to date as long as nobody messes with that scheduled search.

(Other options are to make an oldschool summary index, and also to raise the ante and make an accelerated data model so you can use tstats )

Anyway, assuming you went with a nice simple (somewhat manual) lookup solution, you can then join the rows from this lookup onto our other search and then re-group everything with stats.

index=vmturbo (sourcetype="vmturbo:vm" OR  sourcetype="vmturbo:datastore") 
| eval producer_name=if(sourcetype="vmturbo:datastore",datastore_name,producer_name)
| eval join_id=if(sourcetype="vmturbo:datastore",entity_id,join_id)
| stats count by producer_name join_id
| inputlookup append=t all_vms
| stats sum(count) as count by producer_name join_id vm_name

Notes:

The real nitty gritty around the various fields and isn't really specified here, which is fine. But just note that in the last stats command you may actually need something a little different stats values(producer_name) values(vm_name) by join_id
Or there may well be other fields join was carrying along that are actually important, that we need to now explicitly pass through in our stats expressions.

For further reading, I gave a talk at conf2016 called "Let Stats Sort it Out", the slides for which you can find on the following page under "march 2015"
http://wiki.splunk.com/Virtual_.conf
Some of the examples are pretty advanced, and it assumes a fairly deep familiarity with eval and stats, but with all its examples you may find it useful.

View solution in original post

sideview · ‎06-08-2017

Well, you're far better off learning how to use the eval and stats commands to use join less here, quite possibly to avoid using join here at all.

The best thing about join, is how intuitive it seems when you start using it. That's kind of the only good thing though. There are a lot of reasons to avoid join. I won't go into them much but they include pitfalls where your results don't include all the data so the final numbers are wrong and in general join searches will be far slower and you'll be pulling lots of data back to the search head instead of doing work on the indexer nodes.

Let's take the first join separately.

index_vmturbo sourcetype="vmturbo:vm" 
| join producer_name max=0 [
  search index=vmturbo sourcetype="vmturbo:datastore" 
  | fields datastore_name, entity_id 
  | rename datastore_name as producer_name, entity_id as join_id
]

This can certainly be rewritten to be done with just eval and stats.

index=vmturbo (sourcetype="vmturbo:vm" OR  sourcetype="vmturbo:datastore") 
| eval producer_name=if(sourcetype="vmturbo:datastore",datastore_name,producer_name)
| eval join_id=if(sourcetype="vmturbo:datastore",entity_id,join_id)
| stats count by producer_name join_id

The second "join" can be rewritten to not use the join command by very similar methods, except....

that I see you're making that search run over all time. Presumably the main overall search is being run with some specific time range like "last 7 days".

From this point, there are absolutely still several good options open to still not use join here. Rather than guess which one might be appropriate, I'll wait for a comment back from you giving more details on the explicit timerange here, and then I'll update my answer.

UPDATE ----

Alright, if the "groupmember" data does indeed have to represent "all vm's ever", then I recommend just baking this off into something else to give you vastly better performance. The simplest and most versatile, but perhaps a little odd, is to bake the information into a lookup.

1) Create a lookup. Here I'm calling it "all_vms". If you've never done this, you'll have to peruse the Splunk docs around "file based lookups"
2) run this search over all time.

index=vmturbo  sourcetype="vmturbo:entitygroupmembers"  
| fields entity_id vm_name
| rename entity_id as join_id
| outputlookup all_vms

3) set up a scheduled search to run this slightly different search every so often to add new vm's to the list.

index=vmturbo  sourcetype="vmturbo:entitygroupmembers"  
| fields entity_id vm_name
| rename entity_id as join_id
| inputlookup append=t all_vms
| dedup entity_id vm_name
| outputlookup all_vms

At this point you'll have a lookup file with two columns entity_id and vm_name, it'll stay relatively up to date as long as nobody messes with that scheduled search.

(Other options are to make an oldschool summary index, and also to raise the ante and make an accelerated data model so you can use tstats )

Anyway, assuming you went with a nice simple (somewhat manual) lookup solution, you can then join the rows from this lookup onto our other search and then re-group everything with stats.

index=vmturbo (sourcetype="vmturbo:vm" OR  sourcetype="vmturbo:datastore") 
| eval producer_name=if(sourcetype="vmturbo:datastore",datastore_name,producer_name)
| eval join_id=if(sourcetype="vmturbo:datastore",entity_id,join_id)
| stats count by producer_name join_id
| inputlookup append=t all_vms
| stats sum(count) as count by producer_name join_id vm_name

Notes:

The real nitty gritty around the various fields and isn't really specified here, which is fine. But just note that in the last stats command you may actually need something a little different stats values(producer_name) values(vm_name) by join_id
Or there may well be other fields join was carrying along that are actually important, that we need to now explicitly pass through in our stats expressions.

For further reading, I gave a talk at conf2016 called "Let Stats Sort it Out", the slides for which you can find on the following page under "march 2015"
http://wiki.splunk.com/Virtual_.conf
Some of the examples are pretty advanced, and it assumes a fairly deep familiarity with eval and stats, but with all its examples you may find it useful.

s2jagrif · ‎06-09-2017

Sorry, one last thing. There are many group names, it will not always be WDMix-SQL-TCL01-WM13-SQL-ALL-Database-Drives. I was using one group name within the EntityGroupMembers table in order to test.

sideview · ‎06-09-2017

OK. I wasn't sure. I've updated my answer to remove the group_name searchterms.

s2jagrif · ‎06-09-2017

Thank you, sideview! The groupmember contains the mapping that allows me to group all the datastore by a particular group name. That is the only thing I have to look at over time because Splunk doesn't ingest any information for that entitygroupmembers source past the initial load and only adds a new record when another entity is added. VM and Datastore information is loaded multiple times daily because it is capturing their performance metrics. So, I don't have to look at VM or Datastore for lifetime I just need to do the entitygroupmembers for lifetime so I can map a VM to a Datastore based on the overall group name.

Example:

I select the group "WDMix-SQL-TCL01-WM13-SQL-ALL-Database-Drives" and want to see all the VMs that reside within that group. There is no relationship between VM and EntityGroupMembers. There is a relationship between VM and DataStores. There is also a relationship between EntityGroupMembers and DataStore. I have to Join DataStore and EntityGroupMembers to collect my "lookup" for all datastore within that group_name and output the datastore_name. Then, I have to join the VM to this "lookup" to match the producer_name to the datastore_name. This will provide me the outcome I need.

I hope this is a better explanation.

Thanks again for your help!
I hope this helps explain it a little better.

s2jagrif · ‎06-08-2017

Thank you for your response! I am noticing slow performance with joins and have been reading a bit how they are not the best choice but have been struggling with an alterative based solely on my inexperience with Splunk.

You are correct, I am looking at entity information over a lifetime because the data is static and only gets updated once a new entity is added. The VM and Datastore information are going to be based on a dashboard drop down but and could range a variety of ways. In my test, I was using a 24 hour period for the main data.

DalJeanis · ‎06-08-2017

If updates only occur when new entities are added, then it might be worth considering a summary index, so you aren't recalculating across all time every time you need to know what the current situation is.

s2jagrif · ‎06-09-2017

Interesting. I will look into that! Thanks!

How to join data in one source with another join that combines two searches?

October Community Champions: A Shoutout to Our Contributors!

Community Content Calendar, November Edition

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

Are you a member of the Splunk Community?

How to join data in one source with another join that combines two searches?

October Community Champions: A Shoutout to Our Contributors!

Community Content Calendar, November Edition

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!