<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Lookup filtering performance on very large lookups in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/Lookup-filtering-performance-on-very-large-lookups/m-p/630364#M218996</link>
    <description>&lt;P&gt;This is not a question, rather I am sharing something that I discovered with a Splunk OnDemand support call.&lt;/P&gt;
&lt;P&gt;I thought I was a bit of a Splunk pro, but just goes to show there's always something to learn and this one was so simple it's a little embarrassing&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":flushed_face:"&gt;😳&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Imagine you have a very large lookup with 5 million+ rows (in my case a KV store, containing an extract from Maxmind DB with some other internal references added).&lt;/P&gt;
&lt;P&gt;If you want to do a CIDR match on this, you set up a lookup definition with Match Type "CIDR(network)".&lt;BR /&gt;&lt;BR /&gt;Now run your query (IP address obfuscated in example)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;| makeresults 
| eval ip4 = "127.0.0.1"
| lookup maxmind network AS ip4&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For me on Splunk Cloud, this takes around 50 seconds, and I contacted Splunk Support for some assistance in making this viable (50 seconds just way too high).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I already understood that the issue is that the cidrmatch has to run on every single row - I added a pre filter to my lookup definition and proved that with the pre filter it ran much much faster, but obviously that limited it's use to just the filtered rows.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I tried messing around with inputlookup, but couldn't get it any better&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;| inputlookup maxmind 
| where country_name="Japan" city_name="Osaka"&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This still took the same ~50 seconds to run&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Of course if I had of read the documentation properly, I would have seen that "where" is actually an argument of the inputlookup clause. Changing this to:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;| inputlookup maxmind where country_name="Japan" city_name="Osaka"&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Made all the difference - this now runs in 5-7 seconds as the WHERE clause is now running the same as if you had added the pre-filter to the lookup definition.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To use this in a search to enrich other data, you can use:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;| makeresults 
| eval ip4 = "127.0.0.1"
| appendcols 
    [| inputlookup maxmind where country_name="Japan" city_name="Osaka"]
| where cidrmatch(network, ip4)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Obviously, the tighter you can get your WHERE clause the faster this runs - you can also use accelerated fields in your lookup (if using KV store) to further enhance, this will depend entirely on your data and how you can filter it down to the smallest possible data set before continuing.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For me, using the same 5 million row KV store and maxmind data as my example&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;| makeresults 
| eval ip4 = "127.0.0.1"
| appendcols 
    [| inputlookup maxmind where country_name="Japan" city_name="Osaka" postal_code="541-0051" network_cidr="127.0.*"]
| where cidrmatch(network, ip4)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;runs in ~0.179 seconds [using actual IP address, not the fake one above]. Your mileage may vary, but I hope this helps someone else trying to figure this out.&lt;/P&gt;
&lt;P&gt;I haven't tried the same with a CSV lookup, but I imagine it would be very similar.&lt;/P&gt;</description>
    <pubDate>Fri, 10 Feb 2023 20:28:31 GMT</pubDate>
    <dc:creator>lindonmorris</dc:creator>
    <dc:date>2023-02-10T20:28:31Z</dc:date>
    <item>
      <title>Lookup filtering performance on very large lookups</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Lookup-filtering-performance-on-very-large-lookups/m-p/630364#M218996</link>
      <description>&lt;P&gt;This is not a question, rather I am sharing something that I discovered with a Splunk OnDemand support call.&lt;/P&gt;
&lt;P&gt;I thought I was a bit of a Splunk pro, but just goes to show there's always something to learn and this one was so simple it's a little embarrassing&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":flushed_face:"&gt;😳&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Imagine you have a very large lookup with 5 million+ rows (in my case a KV store, containing an extract from Maxmind DB with some other internal references added).&lt;/P&gt;
&lt;P&gt;If you want to do a CIDR match on this, you set up a lookup definition with Match Type "CIDR(network)".&lt;BR /&gt;&lt;BR /&gt;Now run your query (IP address obfuscated in example)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;| makeresults 
| eval ip4 = "127.0.0.1"
| lookup maxmind network AS ip4&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For me on Splunk Cloud, this takes around 50 seconds, and I contacted Splunk Support for some assistance in making this viable (50 seconds just way too high).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I already understood that the issue is that the cidrmatch has to run on every single row - I added a pre filter to my lookup definition and proved that with the pre filter it ran much much faster, but obviously that limited it's use to just the filtered rows.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I tried messing around with inputlookup, but couldn't get it any better&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;| inputlookup maxmind 
| where country_name="Japan" city_name="Osaka"&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This still took the same ~50 seconds to run&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Of course if I had of read the documentation properly, I would have seen that "where" is actually an argument of the inputlookup clause. Changing this to:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;| inputlookup maxmind where country_name="Japan" city_name="Osaka"&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Made all the difference - this now runs in 5-7 seconds as the WHERE clause is now running the same as if you had added the pre-filter to the lookup definition.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To use this in a search to enrich other data, you can use:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;| makeresults 
| eval ip4 = "127.0.0.1"
| appendcols 
    [| inputlookup maxmind where country_name="Japan" city_name="Osaka"]
| where cidrmatch(network, ip4)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Obviously, the tighter you can get your WHERE clause the faster this runs - you can also use accelerated fields in your lookup (if using KV store) to further enhance, this will depend entirely on your data and how you can filter it down to the smallest possible data set before continuing.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For me, using the same 5 million row KV store and maxmind data as my example&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="csharp"&gt;| makeresults 
| eval ip4 = "127.0.0.1"
| appendcols 
    [| inputlookup maxmind where country_name="Japan" city_name="Osaka" postal_code="541-0051" network_cidr="127.0.*"]
| where cidrmatch(network, ip4)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;runs in ~0.179 seconds [using actual IP address, not the fake one above]. Your mileage may vary, but I hope this helps someone else trying to figure this out.&lt;/P&gt;
&lt;P&gt;I haven't tried the same with a CSV lookup, but I imagine it would be very similar.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2023 20:28:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Lookup-filtering-performance-on-very-large-lookups/m-p/630364#M218996</guid>
      <dc:creator>lindonmorris</dc:creator>
      <dc:date>2023-02-10T20:28:31Z</dc:date>
    </item>
    <item>
      <title>Re: Lookup filtering performance on very large lookups</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Lookup-filtering-performance-on-very-large-lookups/m-p/630377#M218999</link>
      <description>&lt;P&gt;Nice observations.&lt;/P&gt;&lt;P&gt;It would be interesting to know, if you are running a clustered index environment, whether a CSV lookup, that can be distributed to the indexers would improve the run time if the lookup runs on the distributed indexers rather than all data running on the SH where the KV store is.&lt;/P&gt;&lt;P&gt;You are right that the 'where' clause as part of inputlookup is often overlooked, but I have also found it does make a difference, because it filters the data going into the pipeline.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2023 00:54:17 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Lookup-filtering-performance-on-very-large-lookups/m-p/630377#M218999</guid>
      <dc:creator>bowesmana</dc:creator>
      <dc:date>2023-02-10T00:54:17Z</dc:date>
    </item>
  </channel>
</rss>

