Solved: Re: How do I optimize the performance of a dedup s...

Justin · ‎09-30-2015

I am trying to perform a search of our network logs and it seems to be really bogging down our Splunk server. I am trying to get a list of unique IP addresses that are connecting to our VPN appliance over a period of the last 30 days. My search is currently the following:

index=pan_logs earliest=-30d host="10.10.10.10" Destination_IP=1.2.3.4 Rule_Name=VPN sourcetype=pan_traffic |fields Source_IP |table Source_IP |dedup Source_IP sortby Source_IP

If I change the period to the last 3 days, I get results pretty quickly. If I change the period to last 15 days, the search takes an hour or more, but gets results. If I run it for 30 days, after awhile I get Unknown sid and The search job '1443629260.2057' was canceled remotely or expired.

I am fine with the search taking many hours to run if necessary, but I need the results in the end and not have the search expire. Any suggestions on how to make the search faster or not expire are appreciated.

woodcock · ‎09-30-2015

The other answers are good points but they are not answering your question directly. The way to keep your job from expiring is to click on the Job control after you start your search and select Send Job to Background. When you do this it will ask you if you would like to be notified by email when your job completes. Then you just wait for the email and click on the link to see the results!

View solution in original post

martin_mueller · ‎09-30-2015

values() is bad mojo if all you're looking for is that list of values. Instead, do this:

index=pan_logs earliest=-30d host="10.10.10.10" Destination_IP=1.2.3.4 Rule_Name=VPN sourcetype=pan_traffic | stats count by Source_IP

Now you get one row per source IP, sorted already. No need to fiddle around with the multi-value values()... and it'll be much faster than dedup | fields | sort.

The real performance difference from stats vs dedup comes from Splunk's smart search mode switching you to verbose for dedup, extracting all fields, and to fast for stats, extracting only Source_IP.
That won't solve having to look at thirty days' worth of data, see @jeffland's suggestion if you intend to run this search often.

Justin · ‎09-30-2015

I originally had my query as |stats count by Source_IP, but it still took a long time and I started to think the counting was unneeded and potentially process consuming. So I started to look into the other options.

martin_mueller · ‎09-30-2015

Counting is literally a billion times faster than loading the event off disk... so you won't notice any overhead from counting.

Saving on field extraction however will be noticable. I guess the key performance hog is the sheer number of events loaded. This can be solved with summary indexing or acceleration, changing the expiration is just a bandaid.

woodcock · ‎09-30-2015

The other answers are good points but they are not answering your question directly. The way to keep your job from expiring is to click on the Job control after you start your search and select Send Job to Background. When you do this it will ask you if you would like to be notified by email when your job completes. Then you just wait for the email and click on the link to see the results!

Justin · ‎09-30-2015

I think this is the best solution. So far my initial testing shows this would allow the job to run and not expire. This also got me looking into the default job timeout value of 10 minutes. My server is getting a little old, so I think I will tinker with the job life value in /etc/system/local/limits.conf for other queries that run a little long. For this post though, a background job should do it. Thanks.

jeffland · ‎09-30-2015

You could take a look at Summary Indexing, or consider saving these IPs in the KV store.

This is just a hunch, because I don't know if I applied everything correctly here, but I think of a procedure like this: you do a search each day that gives the (deduped) IPs of that day. Then you put those results in a summary index if they are not already in there from during the last 30 days. This process is run once each day, it only has to dedup the comparably small number of daily IP adresses, check whether these IPs exist in the summary index and add them if not. A search against this summary index will only have to fetch all those entries without doing any calculation, so the entire process should be faster than your initial search. Summary indexing doesn't affect your license, either.

You could probably achieve the same logic with the KV store as well, it might even be more elegant. These are just ideas though, I must admit haven't fully thought them through. But I would strongly suggest changing your approach from a search that runs over 30 days of events and does calculations on that amount of data each time to an approach that surveys the change each day brings.

Justin · ‎09-30-2015

Summary indexing would probably work for improving the performance of the query. I don't currently have plans to run it that often, but if that changes, this would be a good solution. Thanks.

maciep · ‎09-30-2015

It might be more efficient to use stats. I think table/dedup/sort can get expensive. Maybe just stats values(source_ip) would work?

index=pan_logs earliest=-30d host="10.10.10.10" Destination_IP=1.2.3.4 Rule_Name=VPN sourcetype=pan_traffic | stats values(Source_IP) as Source_IPs

We had to resort to running searches on the cli recently because of how long some searches were running. Even some of those timed out, so we ended up scripting it to run separate searches in different chunks of time. Which might be another options for you? Meaning, run two 15 day searches or 3 10 day searches? And export the results?

How do I optimize the performance of a dedup search or prevent the search job from expiring?

Splunk Classroom Chronicles: Training Tales and Testimonials

Access Tokens Page - New & Improved

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!