Solved: Performance Expectations

cloudharmony · ‎11-04-2011

I'm running a search against about 1.2 million log records. Each record contains some geo tags and numeric values representing performance metrics. There are a total of about 45 key/values per record including the following:

id: the service id
type: the service type
testId: the type of test (e.g. latency, throughput)
region: the user's geographical region
median: the median performance metric value
ip: the user's IP address

The search query I'm running calculates a 90th percentile median performance value grouped by service id within a specific geographical region, service type and test ID. Here is an example query:

type="CDN" (testId="tl" OR testId="l") region="us" | eventstats perc90(median) as median90 | where median <= median90 | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median

To my disappointment, this query is taking about 5 minutes to run completely on a fairly high end dedicated server (quad core X5570 2.93 GHz, 128GB memory, Raid 0 15K SAS + SSD cache) and much longer on the new hosted splunkstorm service. My question is if this level of performance should be expected for this amount of data and this type of search query. Are there any optimizations that could be made at index or search time in order to improve performance? Is there a significant hit on performance when applying | stats or | eventstats to a search? I've been using splunk for 5 days now... any help would be greatly appreciated.

gkanapathy · ‎11-04-2011

A search like that across that amount of data on that hardware should take something closer to 30 seconds on a single-instance Splunk system, even assuming your testId and region accounts for basically all 1.2 million records.

If you're way off from that, I would try a couple of things first:

If you're running in the web UI/timeline, turn off "Field discovery"
Better yet, try running in the "Advanced Charting" view (under the "Views" menu)
Run in the "Advanced Charting" view, hiding the chart and turning off the "Preview" checkbox.
Even better try running on the command line, and try adding the "-preview false" option.

Just for reference, when I do a slightly smaller (just over 1,000,000 events) and somewhat simpler search than yours on a three-plus year old laptop, I go from taking about 130 seconds to about 70 (doubling the speed) when I turn off field discovery, and then down to 30 seconds (another doubling) in the "Advanced Charting" view (with preview still on), and down to 25 seconds on the command line without preview.

I don't actually see any obvious improvements that can be made to your query while keeping the same results. However, I would be curious as to how it runs if you try each of the following:

Omit the eventstatsand where clauses near the beginning
Omit the sort at the end
Omit the eventstats, where, and sort clauses
And just for kicks, run it with only the base search plus the eventstats and where clauses.

It would also be helpful to know the final number of results returned as well as the scan count. The information in the "Inspect Search Job" page (under the "Actions" menu on the timeline search view) would be useful too, though maybe a bit obscure.

One thing to note is that a single-instance Splunk is not able to take advantage of your hardware when running a single search. I would say that if you're trying to get this to run faster, you could probably run three, four, or even more Splunk instances in a distributed config on that same machine to better utilize it, but that setup takes a bit of work and knowledge to get right.

Finally, seeing a few lines of your data might indicate something, though if it's CDN access logs, that's unlikely.

UPDATE:
Try this and see how it compares. Be sure to turn preview off:

type="CDN" (testId="tl" OR testId="l") region="us" | eval median = if( median <= [ search type="CDN" (testId="tl" OR testId="l") region="us" | stats perc90(median) as search ], median,null()) | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median

Also, what I suspect is happening is that the eventstats is taking a long time to finalize, i.e., the actual computation is getting done pretty quick, but marking up the set of intermediate results is taking a long time. If you are okay with having only an approximate 90th percentile, rather than exact, try:

type="CDN" (testId="tl" OR testId="l") region="us" | eval median = if( median <= [ search type="CDN" (testId="tl" OR testId="l") region="us" | head 9999 | stats perc90(median) as search ], median,null()) | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median

which should only look at the most recent 9999 events to compute the 90th percentile, rather than scanning all 1.2 million events. This should be a lot faster than the previous, though different.

View solution in original post

gkanapathy · ‎11-04-2011

A search like that across that amount of data on that hardware should take something closer to 30 seconds on a single-instance Splunk system, even assuming your testId and region accounts for basically all 1.2 million records.

If you're way off from that, I would try a couple of things first:

If you're running in the web UI/timeline, turn off "Field discovery"
Better yet, try running in the "Advanced Charting" view (under the "Views" menu)
Run in the "Advanced Charting" view, hiding the chart and turning off the "Preview" checkbox.
Even better try running on the command line, and try adding the "-preview false" option.

Just for reference, when I do a slightly smaller (just over 1,000,000 events) and somewhat simpler search than yours on a three-plus year old laptop, I go from taking about 130 seconds to about 70 (doubling the speed) when I turn off field discovery, and then down to 30 seconds (another doubling) in the "Advanced Charting" view (with preview still on), and down to 25 seconds on the command line without preview.

I don't actually see any obvious improvements that can be made to your query while keeping the same results. However, I would be curious as to how it runs if you try each of the following:

Omit the eventstatsand where clauses near the beginning
Omit the sort at the end
Omit the eventstats, where, and sort clauses
And just for kicks, run it with only the base search plus the eventstats and where clauses.

It would also be helpful to know the final number of results returned as well as the scan count. The information in the "Inspect Search Job" page (under the "Actions" menu on the timeline search view) would be useful too, though maybe a bit obscure.

One thing to note is that a single-instance Splunk is not able to take advantage of your hardware when running a single search. I would say that if you're trying to get this to run faster, you could probably run three, four, or even more Splunk instances in a distributed config on that same machine to better utilize it, but that setup takes a bit of work and knowledge to get right.

Finally, seeing a few lines of your data might indicate something, though if it's CDN access logs, that's unlikely.

UPDATE:
Try this and see how it compares. Be sure to turn preview off:

type="CDN" (testId="tl" OR testId="l") region="us" | eval median = if( median <= [ search type="CDN" (testId="tl" OR testId="l") region="us" | stats perc90(median) as search ], median,null()) | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median

Also, what I suspect is happening is that the eventstats is taking a long time to finalize, i.e., the actual computation is getting done pretty quick, but marking up the set of intermediate results is taking a long time. If you are okay with having only an approximate 90th percentile, rather than exact, try:

type="CDN" (testId="tl" OR testId="l") region="us" | eval median = if( median <= [ search type="CDN" (testId="tl" OR testId="l") region="us" | head 9999 | stats perc90(median) as search ], median,null()) | stats mean(median) as mean median(median) as median stdev(median) as stdev avg(stdDev) as avg_stdev count(median) as num_tests dc(ip) as num_ips by id | eval rel_stdev=100*(stdev/median) | table id, mean, median, avg_stdev, stdev, rel_stdev, num_tests, num_ips | sort median

which should only look at the most recent 9999 events to compute the 90th percentile, rather than scanning all 1.2 million events. This should be a lot faster than the previous, though different.

gkanapathy · ‎11-07-2011

Just updated again. Made a mistake. Basically, I forgot to remove eventstats from the subsearch and replace it with stats. I believe the suggested changes should run that query in about 2 minutes in the GUI, and about 40 seconds on CLI. (Basically, double the time of the version using stats without eventstats.)

gkanapathy · ‎11-07-2011

I will update my answer with a suggestion on something to try to improve performance, but I do not know if it will help. (I believe it will help if you have a distributed/multi-indexer Splunk systems, but I don't know about a single-node.) As for pre-indexing specific fields, retrieval is not really the problem here, and there isn't something currently that will help. If you need to do this over time, using new and more data sets however, you can and should use summary indexing to pre-compute results over subsets of the data, so that you can get the full results faster.

cloudharmony · ‎11-05-2011

Thanks. Here are the stats for the searches you recommended:

GUI with preview/field discovery on: ~5 minutes
GUI with preview/field discover off without eventstats and where clause: 54 seconds
Without eventstats, where or sort clause: 47 seconds

CLI as-is: 1:54
CLI without eventstats or where: 19 seconds
CLI without eventstats, where or sort: 19 seconds
CLI with only eventstats and where: 1:42

So, the big hit is for eventstats. Is there a better way to do a 90th percentile filter? Is there someway to pre-index some of these fields we will commonly search on?

gkanapathy · ‎11-04-2011

Oh, i forgot something important. The GUI and various settings in it can make a huge difference. Editing this above.

sdwilkerson · ‎11-04-2011

Cloudharmony,

There are a few things to consider that might perk up your result times:

Start your search using indexed fields (e.g. sourcetype, source, host, and/or index) to prevent Splunk from having to waste time looking at irrelevant data
If this is a query you will perform often, create a summary search to run at some set interval (e.g. 10 min) then report across the summary data.

Beyond the above, yes you do take performance hits with various splunk analytic commands and there is some guidance to help improve this in the Splunk docs.

Sean

Performance Expectations

Buttercup Games: Further Dashboarding Techniques (Part 3)

Digital Resilience Assessment Launch | How prepared are you for disruption?

Buttercup Games: Further Dashboarding Techniques (Part 2)