I am trying to summarize network traffic logged by our firewall to determine the factors that have made our index usage exceed estimates by about 50%.
The queries that I am running take several days, so I am hoping that there are some optimizations that might help speed them up.
There are a lot of data (hundreds of millions of log records per day), so I am asking a lot of Splunk to summarize
Indexing logs do not go far enough back to let me do analysis on them. I also have a datamodel that does not go far enough back.
So I cannot take advantage of the faster access that these models can provide
I have already run queries to determine 3 points in time (a week each) establishing an increase in log record volume
Here is what I am doing
index=pan_logs
| eval loc=case(match(host,"fw_loc1*"), "Loc1", match(host,"fw_loc2.*"), "Loc1", match(host,"fw_loc3.*"), "Loc2", match(host,"fw_loc4.*"), "Loc3", match(host,"fw_loc5.*"), "Loc4", match(host,"fw_loc6.*"), "Loc5", match(host,"fw_loc7.*"), "Loc5", match(host,"fw_loc8.*"), "Loc5", match(host,"fw_loc9.*"),"test", match(host,"fw_locA.*"),"external", match(host,"fw_locB.*"),"external", 1=1,host)
| eval istraffic=if(match(eventtype,"pan_traffic"), "Y", null), isurl=if(match(eventtype,"pan_url"), "Y", null), isvuln=if(match(log_subtype,"vulnerability"),"Y",null), trafficDir=src_zone."-".dest_zone
| stats count dc(host) values(host) count(istraffic) count(isurl) count(isvuln) dc(trafficDir) by loc
The idea is to try and determine if a group of sources is adding to the volume and then understand why. The trafficDir is a quick measure of whether the increase might be related to increased firewall complexity.
Does anything here look like I could make it more efficient?
There are several options, once DMA is out of the picture.
First, since you have hundreds of millions of events per day and aren't looking for a 100.000% precise answer, you could enable event sampling. A 1:1000 sampling would still look at hundreds of thousands of events per day, giving you a reasonably precise answer - especially when looking for big contributors to volume, they're bound to pop up strongly in sampled events. Ask your friendly neighbourhood statistician about how precise 😛
Second, at least part of your search is based around host - this can be made blazingly fast using tstats
:
| tstats prestats=t count dc(host) values(host) where index=pan_logs by host
| eval loc=case(match(host,"fw_loc1"), "Loc1", match(host,"fw_loc2"), "Loc1", match(host,"fw_loc3"), "Loc2", match(host,"fw_loc4.*"), "Loc3", match(host,"fw_loc5"), "Loc4", match(host,"fw_loc6"), "Loc5", match(host,"fw_loc7"), "Loc5", match(host,"fw_loc8"), "Loc5", match(host,"fw_loc9"),"test", match(host,"fw_locA"),"external", match(host,"fw_locB"),"external", 1=1,host)
| stats count dc(host) values(host) by loc
Note, I've removed the .*
from the regexes because they contribute no value to match()
.
The tstats
should be fast enough to run without sampling, increasing your confidence in the results you got from the sampled raw data search.
If these two approaches combined are enough, great.
If not, there is a bit more magic you can attempt with tstats
- it can match on indexed terms. Depending on the definitions of those eventtypes, deciding whether they match or not might just require looking for a certain indexed term, letting you count by eventtype (sort of) without actually needing the raw data.