I'm running a search against about 1.2 million log records. Each record contains some geo tags and numeric values representing performance metrics. There are a total of about 45 key/values per record including the following:
The search query I'm running calculates a 90th percentile median performance value grouped by service id within a specific geographical region, service type and test ID. Here is an example query:
type="CDN" (testId="tl" OR testId="l") region="us" | eventstats perc90(median) as median90 | where median <= median90 | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median
To my disappointment, this query is taking about 5 minutes to run completely on a fairly high end dedicated server (quad core X5570 2.93 GHz, 128GB memory, Raid 0 15K SAS + SSD cache) and much longer on the new hosted splunkstorm service. My question is if this level of performance should be expected for this amount of data and this type of search query. Are there any optimizations that could be made at index or search time in order to improve performance? Is there a significant hit on performance when applying | stats
or | eventstats
to a search? I've been using splunk for 5 days now... any help would be greatly appreciated.
A search like that across that amount of data on that hardware should take something closer to 30 seconds on a single-instance Splunk system, even assuming your testId
and region
accounts for basically all 1.2 million records.
If you're way off from that, I would try a couple of things first:
Just for reference, when I do a slightly smaller (just over 1,000,000 events) and somewhat simpler search than yours on a three-plus year old laptop, I go from taking about 130 seconds to about 70 (doubling the speed) when I turn off field discovery, and then down to 30 seconds (another doubling) in the "Advanced Charting" view (with preview still on), and down to 25 seconds on the command line without preview.
I don't actually see any obvious improvements that can be made to your query while keeping the same results. However, I would be curious as to how it runs if you try each of the following:
eventstats
and where
clauses near the beginningsort
at the endeventstats
, where
, and sort
clauseseventstats
and where
clauses.It would also be helpful to know the final number of results returned as well as the scan count. The information in the "Inspect Search Job" page (under the "Actions" menu on the timeline search view) would be useful too, though maybe a bit obscure.
One thing to note is that a single-instance Splunk is not able to take advantage of your hardware when running a single search. I would say that if you're trying to get this to run faster, you could probably run three, four, or even more Splunk instances in a distributed config on that same machine to better utilize it, but that setup takes a bit of work and knowledge to get right.
Finally, seeing a few lines of your data might indicate something, though if it's CDN access logs, that's unlikely.
UPDATE:
Try this and see how it compares. Be sure to turn preview off:
type="CDN" (testId="tl" OR testId="l") region="us" | eval median = if( median <= [ search type="CDN" (testId="tl" OR testId="l") region="us" | stats perc90(median) as search ], median,null()) | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median
Also, what I suspect is happening is that the eventstats
is taking a long time to finalize, i.e., the actual computation is getting done pretty quick, but marking up the set of intermediate results is taking a long time. If you are okay with having only an approximate 90th percentile, rather than exact, try:
type="CDN" (testId="tl" OR testId="l") region="us" | eval median = if( median <= [ search type="CDN" (testId="tl" OR testId="l") region="us" | head 9999 | stats perc90(median) as search ], median,null()) | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median
which should only look at the most recent 9999 events to compute the 90th percentile, rather than scanning all 1.2 million events. This should be a lot faster than the previous, though different.
A search like that across that amount of data on that hardware should take something closer to 30 seconds on a single-instance Splunk system, even assuming your testId
and region
accounts for basically all 1.2 million records.
If you're way off from that, I would try a couple of things first:
Just for reference, when I do a slightly smaller (just over 1,000,000 events) and somewhat simpler search than yours on a three-plus year old laptop, I go from taking about 130 seconds to about 70 (doubling the speed) when I turn off field discovery, and then down to 30 seconds (another doubling) in the "Advanced Charting" view (with preview still on), and down to 25 seconds on the command line without preview.
I don't actually see any obvious improvements that can be made to your query while keeping the same results. However, I would be curious as to how it runs if you try each of the following:
eventstats
and where
clauses near the beginningsort
at the endeventstats
, where
, and sort
clauseseventstats
and where
clauses.It would also be helpful to know the final number of results returned as well as the scan count. The information in the "Inspect Search Job" page (under the "Actions" menu on the timeline search view) would be useful too, though maybe a bit obscure.
One thing to note is that a single-instance Splunk is not able to take advantage of your hardware when running a single search. I would say that if you're trying to get this to run faster, you could probably run three, four, or even more Splunk instances in a distributed config on that same machine to better utilize it, but that setup takes a bit of work and knowledge to get right.
Finally, seeing a few lines of your data might indicate something, though if it's CDN access logs, that's unlikely.
UPDATE:
Try this and see how it compares. Be sure to turn preview off:
type="CDN" (testId="tl" OR testId="l") region="us" | eval median = if( median <= [ search type="CDN" (testId="tl" OR testId="l") region="us" | stats perc90(median) as search ], median,null()) | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median
Also, what I suspect is happening is that the eventstats
is taking a long time to finalize, i.e., the actual computation is getting done pretty quick, but marking up the set of intermediate results is taking a long time. If you are okay with having only an approximate 90th percentile, rather than exact, try:
type="CDN" (testId="tl" OR testId="l") region="us" | eval median = if( median <= [ search type="CDN" (testId="tl" OR testId="l") region="us" | head 9999 | stats perc90(median) as search ], median,null()) | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median
which should only look at the most recent 9999 events to compute the 90th percentile, rather than scanning all 1.2 million events. This should be a lot faster than the previous, though different.
Just updated again. Made a mistake. Basically, I forgot to remove eventstats
from the subsearch and replace it with stats
. I believe the suggested changes should run that query in about 2 minutes in the GUI, and about 40 seconds on CLI. (Basically, double the time of the version using stats
without eventstats
.)
I will update my answer with a suggestion on something to try to improve performance, but I do not know if it will help. (I believe it will help if you have a distributed/multi-indexer Splunk systems, but I don't know about a single-node.) As for pre-indexing specific fields, retrieval is not really the problem here, and there isn't something currently that will help. If you need to do this over time, using new and more data sets however, you can and should use summary indexing to pre-compute results over subsets of the data, so that you can get the full results faster.
Thanks. Here are the stats for the searches you recommended:
GUI with preview/field discovery on: ~5 minutes
GUI with preview/field discover off without eventstats
and where
clause: 54 seconds
Without eventstats
, where
or sort
clause: 47 seconds
CLI as-is: 1:54
CLI without eventstats
or where
: 19 seconds
CLI without eventstats
, where
or sort
: 19 seconds
CLI with only eventstats
and where
: 1:42
So, the big hit is for eventstats
. Is there a better way to do a 90th percentile filter? Is there someway to pre-index some of these fields we will commonly search on?
Oh, i forgot something important. The GUI and various settings in it can make a huge difference. Editing this above.
Cloudharmony,
There are a few things to consider that might perk up your result times:
Beyond the above, yes you do take performance hits with various splunk analytic commands and there is some guidance to help improve this in the Splunk docs.
Sean