I have 2 large data sets
Data Set 1 (Assets) contains information about devices. For example the dataset will have the following fields
Data Set 2 (Software) contains software installed for those devices (I did a check on the number of events and there is somewhere around 4.5 million events returned over a 24h window). For example the dataset will have the following fields
What I am working on is to join these 2 data sets based on the machine_name / host name. I am then specifically looking for specific software and doing a match if found. A simple join here would be something like (the following example shows 1 eval, but this could conceivably be 2,3 or more depending on the requirement)
index=assets | dedup host_name | rename host_name AS machine_name
| join machine_name
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled
If I then change this search to be a little more specific
index=assets host_name=mysrv1 | dedup host_name | rename host_name AS machine_name
| join machine_name
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled
I get maybe 1 result or maybe a couple of nothing at all. I know the reason why and this is due to hitting the maximum of 50000 returned events in my subsearch (the finished job drop down shows subsearch produced 50000 results truncating to maxout 50000. If I configure with a max=0 I run into the similar problem and again my results are skewed
index=assets host_name=mysrv1 | dedup host_name | rename host_name AS machine_name
| join machine_name max=0
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled
If I write the following SPL
index=software machine_name=mysrv1
| dedup sw_installed
| table sw_installed machine_name
that search will produce for example 90 events for the software installed on the server. If I run the following search for example
index=assets host_name=mysrv1 | dedup host_name
I will get the 1 events as intended. How do I fix my search so that I can get around this constraint?
I ran this but found that the search takes a phenomenal time. To reduce the search time I ended up adjusting as follows
(index=assets host_name=mysrv1) OR (index=software machine_name=mysrv1)
| eval host_name=coalesce(host_name,machine_name)
| eventstats dc(index) as dc_index by host_name
| where dc_index=2
| stats values(sw_installed) as sw_installed by host_name
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table host_name,jobinstalled
This means that the search takes almost no-time to run (I get a result almost instantly). However without the second "machine_name=mysrv1", the search takes a good couple of minutes. I am intending to put this in a dashboard where someone uses a pick list to select a "parent" which will show related children and then have this generate relevant tables. The problem is the time between someone selecting it and it running executing. I might try and see if I can possibly token the pick list and add this to the code somehow.
why can't you use Splunk Time input?
I am using the SPLUNK time picker and have this set at 24 hours. I also substituted the time picker with an explicit "earliest=-24h latest=now" in the SPL and still takes a long time. The run time is 132 seconds (just over 2 minutes). The search without To return the results almost immediately I still have to add that secondary constraint in the second index (i.e. machine_name=mysrv1). The run time when adding this additional constraint is 2.2 seconds.
One potential workaround I am thinking may be useful is to seperate my search and basically (particularly from a dashboard view)
if you are depending on dashboard heavily and you can't wait long time. I would advice you to create subset using only required fields. This will improve your search performance as well.
The most efficient and effective method while joining two big datasets is to use coalesce and this will do trick:
(index=assets host_name=mysrv1) OR (index=software)
| eval host_name=coalesce(host_name,machine_name)
| eventstats dc(index) as dc_index by host_name
| where dc_index=2
| stats values(sw_installed) as sw_installed by host_name
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table host_name,jobinstalled