Solved: How to optimize my lookup search with large amount...

JetteBra · ‎03-23-2017

I'm currently collecting IoCs in terms of IPs and Domain names and want to run searches towards my historical log-data to find infected computers.
Currently I'm putting the data into two lookup files and structured the data like this:
Dangerous,Dangerous2
IP,IP
IP2,IP2
And the same for Domain.

This is the search I'm running for the IPs:

index=*Logdata for analysis* sourcetype=*Firewall* | lookup Dangerous_Ips.csv Dangerous AS dest_ip OUTPUT Dangerous2 AS dest_host2

This works fine as the new field dest_host2 includes all the matches and i have verified this by adding known IPs from the logdata. However the searches seem to take a long time, and I'm not sure if its due to my non-optimized search or that its just too much logdata.
My goal was to search through the last 7 days each day, however that will on average be 168 000 000 log rows and when i tried it now searching the first day took about 2h. Time is not critical to me but 10-12hours for one search feels a bit to long.

Is there anything i can do with my lookup search or the data layout in my lookup file, or i just need to accept the fact that it will take a lot of time?

DalJeanis · ‎03-23-2017

What are you planning to do with the data? If you are just producing summary reporting, then processing the same data 7 times is not effective. Determine what you are going to do with the data, and collect a summary index of only the lowest level of granularity that will get you that reporting.

If you only want the records out of your search that match the csv, then a join might be more effective than a lookup. Test this...

 index=*Logdata for analysis* sourcetype=*Firewall* | join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]

If the Dangerous IPs are only a small percentage of your data, and the dest_ip field is an indexed field -- or a least an extracted field at search time --then you can let splunk do the work of extracting/eliminating records before you do the lookup or join.

  index=*Logdata for analysis* sourcetype=*Firewall* 
[|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip | table dest_ip | format] 
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]

The format command is going to take the list of dest_ips and produce from it search code that looks like (dest_ip="XXXX" OR dest_ip="XXX" OR dest_ip="XXXX" ....). That output can be reformatted to do other types of searches, but the default behavior is probably what you want to cut your search time. If dest_ip was going to be extracted with a rex or otherwise calculated, then you would use a more complicated format command, possibly followed by a regex, to produce an efficient search.

View solution in original post

DalJeanis · ‎03-23-2017

What are you planning to do with the data? If you are just producing summary reporting, then processing the same data 7 times is not effective. Determine what you are going to do with the data, and collect a summary index of only the lowest level of granularity that will get you that reporting.

If you only want the records out of your search that match the csv, then a join might be more effective than a lookup. Test this...

 index=*Logdata for analysis* sourcetype=*Firewall* | join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]

If the Dangerous IPs are only a small percentage of your data, and the dest_ip field is an indexed field -- or a least an extracted field at search time --then you can let splunk do the work of extracting/eliminating records before you do the lookup or join.

  index=*Logdata for analysis* sourcetype=*Firewall* 
[|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip | table dest_ip | format] 
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]

The format command is going to take the list of dest_ips and produce from it search code that looks like (dest_ip="XXXX" OR dest_ip="XXX" OR dest_ip="XXXX" ....). That output can be reformatted to do other types of searches, but the default behavior is probably what you want to cut your search time. If dest_ip was going to be extracted with a rex or otherwise calculated, then you would use a more complicated format command, possibly followed by a regex, to produce an efficient search.

JetteBra · ‎04-04-2017

Hey!
It worked great for the IP´s, thanks so much. Unfortionately the hostname search is this extremely slow but i think thats just because its so much bigger and we have much more proxy log than firewall logs.

DalJeanis · ‎04-04-2017

Suggestion - if you can't create a new summary index, then see if you can do a metadata-only or tstats-style search to meet your host needs.

Your goal is "find infected computers" - so you don't necessarily care what exact time in the last 7 days they were infected etc.

Assuming that you have indexed fields src_ip and dest_ip, try this -

| tstats values(src_ip) as src_ip 
    where index=*Logdata for analysis* AND sourcetype=*Firewall* 
    AND earliest=-7d@d AND latest=-0d@d 
    by dest_ip
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip,  Dangerous2 as dest_host2]

That should be relatively quick in giving you one record per dest_ip that is in Dangerous, with a deduped list of the src_ips that have been associated with that dest_ip in the timeframe under consideration.

How to optimize my lookup search with large amounts of data?

Best Strategies to Optimize Observability Costs

Fueling your curiosity with new Splunk ILT and eLearning courses

Splunk AI Assistant for SPL 1.1.0 | Now Personalized to Your Environment for Greater ...