Solved: A way to improve subsearch performance without cha...

burzynskih · ‎03-28-2016

I am trying to search for data that is in a .csv lookup file and NOT in Splunk. My issue is that my subsearch stops (and returns to the main search) after it processes 10k records, so I can only search within the last 4 hours. I would like to be able to search within the last 24 hours, but I don't want to modify the configuration for the subsearch to handle more than 10k records because I was told that is bad practice.

I have this search query:
| inputlookup macaddress.csv | eval macaddress=upper(macaddress) | search NOT [ search sourcetype=DhcpSrvLog index=dhcp source="C:\Windows\System32\DHCP\DhcpSrvLog*.log" (host="AE-VENOM" OR host="AE-CARNAGE") macaddress=* | fields macaddress ]

I get this message from the job inspector:
[subsearch]: Subsearch produced 10000 results, truncating to maxout 10000.

I was advised to try using a data model to improve performance, so I tried using the Network Session > DHCP data model but I notice that I still get over 10k records. This makes sense to me since the data model improves the speed of the search by focusing on certain fields rather than whole events. The number of records would remain the same since that's determined by the filter content.

So my main question is: Is there a way for me to force the subsearch to process all of the records without making any configuration changes that might be detrimental to other searches? I know that I can change the configuration to allow more records to be processed but is that a good idea?

martin_mueller · ‎03-28-2016

Here's how you get rid of the subsearch entirely:

  sourcetype=DhcpSrvLog index=dhcp source="C:\\Windows\\System32\\DHCP\\DhcpSrvLog*.log" (host="AE-VENOM" OR host="AE-CARNAGE") macaddress=*
| stats count by macaddress | eval from_search = 1
| inputlookup append=t macaddress.csv | eval from_lookup = case(isnull(from_search), 1)
| eval macaddress=upper(macaddress)
| stats values(from*) as from* by macaddress
| where from_lookup=1 AND isnull(from_search)

View solution in original post

martin_mueller · ‎03-28-2016

Here's how you get rid of the subsearch entirely:

  sourcetype=DhcpSrvLog index=dhcp source="C:\\Windows\\System32\\DHCP\\DhcpSrvLog*.log" (host="AE-VENOM" OR host="AE-CARNAGE") macaddress=*
| stats count by macaddress | eval from_search = 1
| inputlookup append=t macaddress.csv | eval from_lookup = case(isnull(from_search), 1)
| eval macaddress=upper(macaddress)
| stats values(from*) as from* by macaddress
| where from_lookup=1 AND isnull(from_search)

burzynskih · ‎03-28-2016

This is exactly what I needed. Thank you so much!

martin_mueller · ‎03-28-2016

Have you tried the search with stats? As you posted your search, it's not treating duplicate MACs. Depending on your data, reducing the data from the subsearch from "number of events" to "number of unique MACs" might be enough.

martin_mueller · ‎03-28-2016

What are you trying to achieve?

Find all MACs that are present in the lookup but not in the indexed data?
The reverse?
Find MACs present in both?

To improve the search as-is, try appending a stats:

| inputlookup macaddress.csv | eval macaddress=upper(macaddress) | search NOT [ search sourcetype=DhcpSrvLog index=dhcp source="C:\\Windows\\System32\\DHCP\\DhcpSrvLog*.log" (host="AE-VENOM" OR host="AE-CARNAGE") macaddress=* | stats count by macaddress | fields macaddress ]

burzynskih · ‎03-28-2016

"I am trying to search for data that is in a .csv lookup file and NOT in Splunk." so yes I am trying to find all MACs that are present in the lookup but not the indexed data. Thanks for the optimization advice, but my main issue is that my query returns incorrect information because the subsearch maxes out after 10k records (and I have 50k to process for a 24 hour timespan). I can easily change the configuration file to raise the maximum number of records the subsearch processes, but I was told that doing so could be detrimental. I was wondering if there is another way for me to force the subsearch to process all files or should I just change the configuration to raise the max number of records processed to 50k.

A way to improve subsearch performance without changing configuration?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!