Splunk Search

Assistance with combining / joining on 2 large data sets

willadams
Contributor

I have 2 large data sets

 

Data Set 1 (Assets) contains information about devices.  For example the dataset will have the following fields

  • hostname
  • ipaddress
  • status
  • parent

 

Data Set 2 (Software) contains software installed for those devices (I did a check on the number of events and there is somewhere around 4.5 million events returned over a 24h window).  For example the dataset will have the following fields

  • machine_name
  • sw_installed
  • installed_date

What I am working on is to join these 2 data sets based on the machine_name / host name.  I am then specifically looking for specific software and doing a match if found.  A simple join here would be something like (the following example shows 1 eval, but this could conceivably be 2,3 or more depending on the requirement)

 

 

index=assets | dedup host_name | rename host_name AS machine_name
| join machine_name
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled

 

 

 

If I then change this search to be a little more specific

 

 

index=assets host_name=mysrv1 | dedup host_name | rename host_name AS machine_name
| join machine_name
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled

 

 

 

I get maybe 1 result or maybe a couple of nothing at all.  I know the reason why and this is due to hitting the maximum of 50000 returned events in my subsearch (the finished job drop down shows subsearch produced 50000 results truncating to maxout 50000.  If I configure with a max=0 I run into the similar problem and again my results are skewed

 

 

index=assets host_name=mysrv1 | dedup host_name | rename host_name AS machine_name
| join machine_name max=0
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled

 

 

 

If I write the following SPL

 

 

index=software machine_name=mysrv1
| dedup sw_installed
| table sw_installed machine_name

 

 

 

that search will produce for example 90 events for the software installed on the server.  If I run the following search for example

 

 

index=assets host_name=mysrv1 | dedup host_name

 

 

 

I will get the 1 events as intended.  How do I fix my search so that I can get around this constraint?  

Labels (2)
0 Karma

willadams
Contributor

I ran this but found that the search takes a phenomenal time.  To reduce the search time I ended up adjusting as follows

(index=assets host_name=mysrv1) OR (index=software machine_name=mysrv1)
| eval host_name=coalesce(host_name,machine_name)
| eventstats dc(index) as dc_index by host_name
| where dc_index=2
| stats values(sw_installed) as sw_installed by host_name
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table host_name,jobinstalled

 

This means that the search takes almost no-time to run (I get a result almost instantly).  However without the second "machine_name=mysrv1", the search takes a good couple of minutes.  I am intending to put this in a dashboard where someone uses a pick list to select a "parent" which will show related children and then have this generate relevant tables.  The problem is the time between someone selecting it and it running executing.  I might try and see if I can possibly token the pick list and add this to the code somehow.

0 Karma

thambisetty
SplunkTrust
SplunkTrust

why can't you use Splunk Time input?

————————————
If this helps, give a like below.
0 Karma

willadams
Contributor

I am using the SPLUNK time picker and have this set at 24 hours.  I also substituted the time picker with an explicit "earliest=-24h latest=now" in the SPL and still takes a long time.  The run time is 132 seconds (just over 2 minutes).  The search without To return the results almost immediately I still have to add that secondary constraint in the second index (i.e. machine_name=mysrv1).  The run time when adding this additional constraint is 2.2 seconds.

 

One potential workaround I am thinking may be useful is to seperate my search and basically (particularly from a dashboard view)

  • Do my first search and output to a CSV using an outputlookup
  • Then run my second search against the CSV

 

0 Karma

thambisetty
SplunkTrust
SplunkTrust

if you are depending on dashboard heavily and you can't wait long time. I would advice you to create subset using only required fields. This will improve your search performance as well.

————————————
If this helps, give a like below.
0 Karma

thambisetty
SplunkTrust
SplunkTrust

The most efficient and effective method while joining two big datasets is to use coalesce and this will do trick:

(index=assets host_name=mysrv1) OR (index=software)
| eval host_name=coalesce(host_name,machine_name)
| eventstats dc(index) as dc_index by host_name
| where dc_index=2
| stats values(sw_installed) as sw_installed by host_name
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table host_name,jobinstalled

 

————————————
If this helps, give a like below.
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...