<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to optimize my lookup search with large amounts of data? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371601#M109353</link>
    <description>&lt;P&gt;Hey!&lt;BR /&gt;
It worked great for the IP´s, thanks so much. Unfortionately the hostname search is this extremely slow but i think thats just because its so much bigger and we have much more proxy log than firewall logs.&lt;/P&gt;</description>
    <pubDate>Tue, 04 Apr 2017 13:38:37 GMT</pubDate>
    <dc:creator>JetteBra</dc:creator>
    <dc:date>2017-04-04T13:38:37Z</dc:date>
    <item>
      <title>How to optimize my lookup search with large amounts of data?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371599#M109351</link>
      <description>&lt;P&gt;I'm currently collecting IoCs in terms of IPs and Domain names and want to run searches towards my historical log-data to find infected computers.&lt;BR /&gt;
Currently I'm putting the data into two lookup files and structured the data like this:&lt;BR /&gt;
Dangerous,Dangerous2&lt;BR /&gt;
IP,IP&lt;BR /&gt;
IP2,IP2&lt;BR /&gt;
And the same for Domain.&lt;/P&gt;

&lt;P&gt;This is the search I'm running for the IPs:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=*Logdata for analysis* sourcetype=*Firewall* | lookup Dangerous_Ips.csv Dangerous AS dest_ip OUTPUT Dangerous2 AS dest_host2
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;This works fine as the new field dest_host2 includes all the matches and i have verified this by adding known IPs from the logdata. However the searches seem to take a long time, and I'm not sure if its due to my non-optimized search or that its just too much logdata.&lt;BR /&gt;
My goal was to search through the last 7 days each day, however that will on average be 168 000 000 log rows and when i tried it now searching the first day took about 2h. Time is not critical to me but 10-12hours for one search feels a bit to long.&lt;/P&gt;

&lt;P&gt;Is there anything i can do with my lookup search or the data layout in my lookup file, or i just need to accept the fact that it will take a lot of time?&lt;/P&gt;</description>
      <pubDate>Thu, 23 Mar 2017 15:59:39 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371599#M109351</guid>
      <dc:creator>JetteBra</dc:creator>
      <dc:date>2017-03-23T15:59:39Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize my lookup search with large amounts of data?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371600#M109352</link>
      <description>&lt;P&gt;What are you planning to do with the data? If you are just producing summary reporting, then processing the same data 7 times is not effective.  Determine what you are going to do with the data, and &lt;CODE&gt;collect&lt;/CODE&gt; a summary index of only the lowest level of granularity that will get you that reporting.&lt;/P&gt;

&lt;P&gt;If you only want the records out of your search that match the csv, then a join might be more effective than a lookup.  Test this...&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt; index=*Logdata for analysis* sourcetype=*Firewall* | join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;If the Dangerous IPs are only a small percentage of your data, and the &lt;CODE&gt;dest_ip&lt;/CODE&gt; field is an indexed field -- or a least an extracted field at search time --then you can let splunk do the work of extracting/eliminating records before you do the &lt;CODE&gt;lookup&lt;/CODE&gt; or &lt;CODE&gt;join&lt;/CODE&gt;.  &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;  index=*Logdata for analysis* sourcetype=*Firewall* 
[|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip | table dest_ip | format] 
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The &lt;CODE&gt;format&lt;/CODE&gt; command is going to take the list of &lt;CODE&gt;dest_ips&lt;/CODE&gt; and produce from it search code that looks like &lt;CODE&gt;(dest_ip="XXXX" OR dest_ip="XXX" OR dest_ip="XXXX" ....)&lt;/CODE&gt;.  That output can be reformatted to do other types of searches, but the default behavior is probably what you want to cut your search time.  If &lt;CODE&gt;dest_ip&lt;/CODE&gt; was going to be extracted with a &lt;CODE&gt;rex&lt;/CODE&gt; or otherwise calculated, then you would use a more complicated &lt;CODE&gt;format&lt;/CODE&gt; command, possibly followed by a &lt;CODE&gt;regex&lt;/CODE&gt;, to  produce an efficient search.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Mar 2017 17:52:24 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371600#M109352</guid>
      <dc:creator>DalJeanis</dc:creator>
      <dc:date>2017-03-23T17:52:24Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize my lookup search with large amounts of data?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371601#M109353</link>
      <description>&lt;P&gt;Hey!&lt;BR /&gt;
It worked great for the IP´s, thanks so much. Unfortionately the hostname search is this extremely slow but i think thats just because its so much bigger and we have much more proxy log than firewall logs.&lt;/P&gt;</description>
      <pubDate>Tue, 04 Apr 2017 13:38:37 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371601#M109353</guid>
      <dc:creator>JetteBra</dc:creator>
      <dc:date>2017-04-04T13:38:37Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize my lookup search with large amounts of data?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371602#M109354</link>
      <description>&lt;P&gt;Suggestion - if you can't create a new summary index, then see if you can do a metadata-only or tstats-style search to meet your host needs.  &lt;/P&gt;

&lt;P&gt;Your goal is "find infected computers" - so you don't necessarily care what exact time in the last 7 days they were infected etc.  &lt;/P&gt;

&lt;P&gt;Assuming that you have indexed fields src_ip and dest_ip, try this - &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;| tstats values(src_ip) as src_ip 
    where index=*Logdata for analysis* AND sourcetype=*Firewall* 
    AND earliest=-7d@d AND latest=-0d@d 
    by dest_ip
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;That should be relatively quick in giving you one record per dest_ip that is in Dangerous, with a deduped list of the src_ips that have been associated with that dest_ip in the timeframe under consideration.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Sep 2020 13:27:28 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-optimize-my-lookup-search-with-large-amounts-of-data/m-p/371602#M109354</guid>
      <dc:creator>DalJeanis</dc:creator>
      <dc:date>2020-09-29T13:27:28Z</dc:date>
    </item>
  </channel>
</rss>

