I'm currently collecting IoCs in terms of IPs and Domain names and want to run searches towards my historical log-data to find infected computers.
Currently I'm putting the data into two lookup files and structured the data like this:
Dangerous,Dangerous2
IP,IP
IP2,IP2
And the same for Domain.
This is the search I'm running for the IPs:
index=*Logdata for analysis* sourcetype=*Firewall* | lookup Dangerous_Ips.csv Dangerous AS dest_ip OUTPUT Dangerous2 AS dest_host2
This works fine as the new field dest_host2 includes all the matches and i have verified this by adding known IPs from the logdata. However the searches seem to take a long time, and I'm not sure if its due to my non-optimized search or that its just too much logdata.
My goal was to search through the last 7 days each day, however that will on average be 168 000 000 log rows and when i tried it now searching the first day took about 2h. Time is not critical to me but 10-12hours for one search feels a bit to long.
Is there anything i can do with my lookup search or the data layout in my lookup file, or i just need to accept the fact that it will take a lot of time?
What are you planning to do with the data? If you are just producing summary reporting, then processing the same data 7 times is not effective. Determine what you are going to do with the data, and collect
a summary index of only the lowest level of granularity that will get you that reporting.
If you only want the records out of your search that match the csv, then a join might be more effective than a lookup. Test this...
index=*Logdata for analysis* sourcetype=*Firewall* | join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip, Dangerous2 as dest_host2]
If the Dangerous IPs are only a small percentage of your data, and the dest_ip
field is an indexed field -- or a least an extracted field at search time --then you can let splunk do the work of extracting/eliminating records before you do the lookup
or join
.
index=*Logdata for analysis* sourcetype=*Firewall*
[|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip | table dest_ip | format]
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip, Dangerous2 as dest_host2]
The format
command is going to take the list of dest_ips
and produce from it search code that looks like (dest_ip="XXXX" OR dest_ip="XXX" OR dest_ip="XXXX" ....)
. That output can be reformatted to do other types of searches, but the default behavior is probably what you want to cut your search time. If dest_ip
was going to be extracted with a rex
or otherwise calculated, then you would use a more complicated format
command, possibly followed by a regex
, to produce an efficient search.
What are you planning to do with the data? If you are just producing summary reporting, then processing the same data 7 times is not effective. Determine what you are going to do with the data, and collect
a summary index of only the lowest level of granularity that will get you that reporting.
If you only want the records out of your search that match the csv, then a join might be more effective than a lookup. Test this...
index=*Logdata for analysis* sourcetype=*Firewall* | join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip, Dangerous2 as dest_host2]
If the Dangerous IPs are only a small percentage of your data, and the dest_ip
field is an indexed field -- or a least an extracted field at search time --then you can let splunk do the work of extracting/eliminating records before you do the lookup
or join
.
index=*Logdata for analysis* sourcetype=*Firewall*
[|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip | table dest_ip | format]
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip, Dangerous2 as dest_host2]
The format
command is going to take the list of dest_ips
and produce from it search code that looks like (dest_ip="XXXX" OR dest_ip="XXX" OR dest_ip="XXXX" ....)
. That output can be reformatted to do other types of searches, but the default behavior is probably what you want to cut your search time. If dest_ip
was going to be extracted with a rex
or otherwise calculated, then you would use a more complicated format
command, possibly followed by a regex
, to produce an efficient search.
Hey!
It worked great for the IP´s, thanks so much. Unfortionately the hostname search is this extremely slow but i think thats just because its so much bigger and we have much more proxy log than firewall logs.
Suggestion - if you can't create a new summary index, then see if you can do a metadata-only or tstats-style search to meet your host needs.
Your goal is "find infected computers" - so you don't necessarily care what exact time in the last 7 days they were infected etc.
Assuming that you have indexed fields src_ip and dest_ip, try this -
| tstats values(src_ip) as src_ip
where index=*Logdata for analysis* AND sourcetype=*Firewall*
AND earliest=-7d@d AND latest=-0d@d
by dest_ip
| join [|inputlookup Dangerous_Ips.csv | rename Dangerous as dest_ip, Dangerous2 as dest_host2]
That should be relatively quick in giving you one record per dest_ip that is in Dangerous, with a deduped list of the src_ips that have been associated with that dest_ip in the timeframe under consideration.