I am trying to perform a search of our network logs and it seems to be really bogging down our Splunk server. I am trying to get a list of unique IP addresses that are connecting to our VPN appliance over a period of the last 30 days. My search is currently the following:
index=pan_logs earliest=-30d host="10.10.10.10" Destination_IP=22.214.171.124 Rule_Name=VPN sourcetype=pan_traffic |fields Source_IP |table Source_IP |dedup Source_IP sortby Source_IP
If I change the period to the last 3 days, I get results pretty quickly. If I change the period to last 15 days, the search takes an hour or more, but gets results. If I run it for 30 days, after awhile I get
Unknown sid and
The search job '1443629260.2057' was canceled remotely or expired.
I am fine with the search taking many hours to run if necessary, but I need the results in the end and not have the search expire. Any suggestions on how to make the search faster or not expire are appreciated.
It might be more efficient to use stats. I think table/dedup/sort can get expensive. Maybe just stats values(source_ip) would work?
index=pan_logs earliest=-30d host="10.10.10.10" Destination_IP=126.96.36.199 Rule_Name=VPN sourcetype=pan_traffic | stats values(Source_IP) as Source_IPs
We had to resort to running searches on the cli recently because of how long some searches were running. Even some of those timed out, so we ended up scripting it to run separate searches in different chunks of time. Which might be another options for you? Meaning, run two 15 day searches or 3 10 day searches? And export the results?
You could take a look at Summary Indexing, or consider saving these IPs in the KV store.
This is just a hunch, because I don't know if I applied everything correctly here, but I think of a procedure like this: you do a search each day that gives the (deduped) IPs of that day. Then you put those results in a summary index if they are not already in there from during the last 30 days. This process is run once each day, it only has to dedup the comparably small number of daily IP adresses, check whether these IPs exist in the summary index and add them if not. A search against this summary index will only have to fetch all those entries without doing any calculation, so the entire process should be faster than your initial search. Summary indexing doesn't affect your license, either.
You could probably achieve the same logic with the KV store as well, it might even be more elegant. These are just ideas though, I must admit haven't fully thought them through. But I would strongly suggest changing your approach from a search that runs over 30 days of events and does calculations on that amount of data each time to an approach that surveys the change each day brings.
Summary indexing would probably work for improving the performance of the query. I don't currently have plans to run it that often, but if that changes, this would be a good solution. Thanks.
The other answers are good points but they are not answering your question directly. The way to keep your job from expiring is to click on the
Job control after you start your search and select
Send Job to Background. When you do this it will ask you if you would like to be notified by email when your job completes. Then you just wait for the email and click on the link to see the results!
I think this is the best solution. So far my initial testing shows this would allow the job to run and not expire. This also got me looking into the default job timeout value of 10 minutes. My server is getting a little old, so I think I will tinker with the job life value in /etc/system/local/limits.conf for other queries that run a little long. For this post though, a background job should do it. Thanks.
values() is bad mojo if all you're looking for is that list of values. Instead, do this:
index=pan_logs earliest=-30d host="10.10.10.10" Destination_IP=188.8.131.52 Rule_Name=VPN sourcetype=pan_traffic | stats count by Source_IP
Now you get one row per source IP, sorted already. No need to fiddle around with the multi-value
values()... and it'll be much faster than
dedup | fields | sort.
The real performance difference from
dedup comes from Splunk's smart search mode switching you to verbose for
dedup, extracting all fields, and to fast for
stats, extracting only
That won't solve having to look at thirty days' worth of data, see @jeffland's suggestion if you intend to run this search often.
I originally had my query as |stats count by Source_IP, but it still took a long time and I started to think the counting was unneeded and potentially process consuming. So I started to look into the other options.
Counting is literally a billion times faster than loading the event off disk... so you won't notice any overhead from counting.
Saving on field extraction however will be noticable. I guess the key performance hog is the sheer number of events loaded. This can be solved with summary indexing or acceleration, changing the expiration is just a bandaid.