topic Re: How to optimize my lookup search with large amounts of data? in Splunk Search

How to optimize my lookup search with large amounts of data?

JetteBra — Thu, 23 Mar 2017 15:59:39 GMT

I'm currently collecting IoCs in terms of IPs and Domain names and want to run searches towards my historical log-data to find infected computers.
Currently I'm putting the data into two lookup files and structured the data like this:
Dangerous,Dangerous2
IP,IP
IP2,IP2
And the same for Domain.

This is the search I'm running for the IPs:

index=*Logdata for analysis* sourcetype=*Firewall* | lookup Dangerous_Ips.csv Dangerous AS dest_ip OUTPUT Dangerous2 AS dest_host2

This works fine as the new field dest_host2 includes all the matches and i have verified this by adding known IPs from the logdata. However the searches seem to take a long time, and I'm not sure if its due to my non-optimized search or that its just too much logdata.
My goal was to search through the last 7 days each day, however that will on average be 168 000 000 log rows and when i tried it now searching the first day took about 2h. Time is not critical to me but 10-12hours for one search feels a bit to long.

Is there anything i can do with my lookup search or the data layout in my lookup file, or i just need to accept the fact that it will take a lot of time?

Re: How to optimize my lookup search with large amounts of data?

DalJeanis — Thu, 23 Mar 2017 17:52:24 GMT

What are you planning to do with the data? If you are just producing summary reporting, then processing the same data 7 times is not effective. Determine what you are going to do with the data, and collect a summary index of only the lowest level of granularity that will get you that reporting.

If you only want the records out of your search that match the csv, then a join might be more effective than a lookup. Test this...

 index=*Logdata for analysis* sourcetype=*Firewall* | join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]

If the Dangerous IPs are only a small percentage of your data, and the dest_ip field is an indexed field -- or a least an extracted field at search time --then you can let splunk do the work of extracting/eliminating records before you do the lookup or join.

  index=*Logdata for analysis* sourcetype=*Firewall* 
[|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip | table dest_ip | format] 
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]

The format command is going to take the list of dest_ips and produce from it search code that looks like (dest_ip="XXXX" OR dest_ip="XXX" OR dest_ip="XXXX" ....). That output can be reformatted to do other types of searches, but the default behavior is probably what you want to cut your search time. If dest_ip was going to be extracted with a rex or otherwise calculated, then you would use a more complicated format command, possibly followed by a regex, to produce an efficient search.

Re: How to optimize my lookup search with large amounts of data?

JetteBra — Tue, 04 Apr 2017 13:38:37 GMT

Hey!
It worked great for the IP´s, thanks so much. Unfortionately the hostname search is this extremely slow but i think thats just because its so much bigger and we have much more proxy log than firewall logs.

Re: How to optimize my lookup search with large amounts of data?

DalJeanis — Tue, 29 Sep 2020 13:27:28 GMT

Suggestion - if you can't create a new summary index, then see if you can do a metadata-only or tstats-style search to meet your host needs.

Your goal is "find infected computers" - so you don't necessarily care what exact time in the last 7 days they were infected etc.

Assuming that you have indexed fields src_ip and dest_ip, try this -

| tstats values(src_ip) as src_ip 
    where index=*Logdata for analysis* AND sourcetype=*Firewall* 
    AND earliest=-7d@d AND latest=-0d@d 
    by dest_ip
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]

That should be relatively quick in giving you one record per dest_ip that is in Dangerous, with a deduped list of the src_ips that have been associated with that dest_ip in the timeframe under consideration.