Assistance with combining / joining on 2 large dat...

willadams · ‎09-08-2020

I have 2 large data sets

Data Set 1 (Assets) contains information about devices. For example the dataset will have the following fields

hostname
ipaddress
status
parent

Data Set 2 (Software) contains software installed for those devices (I did a check on the number of events and there is somewhere around 4.5 million events returned over a 24h window). For example the dataset will have the following fields

machine_name
sw_installed
installed_date

What I am working on is to join these 2 data sets based on the machine_name / host name. I am then specifically looking for specific software and doing a match if found. A simple join here would be something like (the following example shows 1 eval, but this could conceivably be 2,3 or more depending on the requirement)

index=assets | dedup host_name | rename host_name AS machine_name
| join machine_name
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled

If I then change this search to be a little more specific

index=assets host_name=mysrv1 | dedup host_name | rename host_name AS machine_name
| join machine_name
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled

I get maybe 1 result or maybe a couple of nothing at all. I know the reason why and this is due to hitting the maximum of 50000 returned events in my subsearch (the finished job drop down shows subsearch produced 50000 results truncating to maxout 50000. If I configure with a max=0 I run into the similar problem and again my results are skewed

index=assets host_name=mysrv1 | dedup host_name | rename host_name AS machine_name
| join machine_name max=0
[search index=software | fields matchine_name sw_installed]
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table machine_name jobinstalled

If I write the following SPL

index=software machine_name=mysrv1
| dedup sw_installed
| table sw_installed machine_name

that search will produce for example 90 events for the software installed on the server. If I run the following search for example

index=assets host_name=mysrv1 | dedup host_name

I will get the 1 events as intended. How do I fix my search so that I can get around this constraint?

willadams · ‎09-08-2020

I ran this but found that the search takes a phenomenal time. To reduce the search time I ended up adjusting as follows

(index=assets host_name=mysrv1) OR (index=software machine_name=mysrv1)
| eval host_name=coalesce(host_name,machine_name)
| eventstats dc(index) as dc_index by host_name
| where dc_index=2
| stats values(sw_installed) as sw_installed by host_name
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table host_name,jobinstalled

This means that the search takes almost no-time to run (I get a result almost instantly). However without the second "machine_name=mysrv1", the search takes a good couple of minutes. I am intending to put this in a dashboard where someone uses a pick list to select a "parent" which will show related children and then have this generate relevant tables. The problem is the time between someone selecting it and it running executing. I might try and see if I can possibly token the pick list and add this to the code somehow.

thambisetty · ‎09-09-2020

why can't you use Splunk Time input?

————————————
If this helps, give a like below.

willadams · ‎09-09-2020

I am using the SPLUNK time picker and have this set at 24 hours. I also substituted the time picker with an explicit "earliest=-24h latest=now" in the SPL and still takes a long time. The run time is 132 seconds (just over 2 minutes). The search without To return the results almost immediately I still have to add that secondary constraint in the second index (i.e. machine_name=mysrv1). The run time when adding this additional constraint is 2.2 seconds.

One potential workaround I am thinking may be useful is to seperate my search and basically (particularly from a dashboard view)

Do my first search and output to a CSV using an outputlookup
Then run my second search against the CSV

thambisetty · ‎09-09-2020

if you are depending on dashboard heavily and you can't wait long time. I would advice you to create subset using only required fields. This will improve your search performance as well.

————————————
If this helps, give a like below.

thambisetty · ‎09-08-2020

The most efficient and effective method while joining two big datasets is to use coalesce and this will do trick:

(index=assets host_name=mysrv1) OR (index=software)
| eval host_name=coalesce(host_name,machine_name)
| eventstats dc(index) as dc_index by host_name
| where dc_index=2
| stats values(sw_installed) as sw_installed by host_name
| eval jobinstalled=if(match(sw_installed,"jobsw"),"Yes","No")
| table host_name,jobinstalled

————————————
If this helps, give a like below.

Assistance with combining / joining on 2 large data sets

join

subsearch

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

Transform your security operations with Splunk Enterprise Security

Splunk Admins and App Developers | Earn a $35 gift card!