Reporting

Why won't reports with high data cardinality get much out of report acceleration?

Path Finder

Splunk "Manage report acceleration" manual specifies the following:

In addition, you can have reports that technically qualify for report acceleration, but which may not be helped much by it. This is often the case with reports with high data cardinality--something you'll find when there are two or more transforming commands in the search string and the first transforming command generates many (50k+) output rows. For example:

index=* | stats count by id | stats avg(count) as avg, count as distinct_ids

My question is why acceleration won't help much in this case? If the first transforming command generates more than 50K output rows i should not be using report acceleration?

Thank you.

0 Karma
1 Solution

Splunk Employee
Splunk Employee

Hi kiril123--

To answer this question, it's probably helpful to revisit what report acceleration does. Say you have an unaccelerated search that runs over the last month, and when it does, it searches through an average of 500k events to return maybe 300 events, which means it has low cardinality--only a few events from the original set are returned. This search takes awhile to complete because it has to look through all 500k events to find those 300 events.

So you accelerate that search. This starts a search running in the background. It runs on a schedule using the same criteria as the original search, and builds a summary of just the results the original search was looking for. This summary will be far smaller than 500k events--it may contain just a few thousand events at any given time. The next time you run your accelerated search, it runs against that summary. Because the summary is far smaller than the original search set, the search should complete much faster than it did before.

Ok, now imagine you have a second unaccelerated search that, like the first search, runs over the past month, and when you run it, it also searches through an average of 500k events. But this search has far higher cardinality than the first search: each time you run it, it matches at least 50k events, maybe more. This second search is slow to complete as well, because it also has to look through 500k events each time it runs.

However, when you accelerate the second search, it ends up with a much larger summary than the first search. This is because the second search has high cardinality. Each time its background search runs, it adds at least 50k events to the summary. If the summary has a range of 3 months, that's an average of 150k events at any given time. When you run the accelerated search, it runs against that summary. It will complete faster than it did before, but not that much faster, because it's still running over a lot of events. Its report acceleration summary just isn't much of a summary.

So the lesson here is--when you must search across large volumes of data, try to design low cardinality searches--searches that return significantly fewer events than the total amount of events searched. Then, when you accelerate them, you'll get searches that are actually accelerated.

I'll see if I can rewrite the documentation to clarify this point.

View solution in original post

SplunkTrust
SplunkTrust

"Should not be using report acceleration" is not a correct interpretation of the documentation. The high cardinality (10%) search is just not going to get as much of a boost out of it as a search with lower cardinality might have gotten. Whether you can (and should) spare the CPU cycles to do the acceleration is up to you.

Efficient search design is highly data dependent. The exact same search in one installation might be fine, while in another installation it dims the lights due to the data involved. Report acceleration is one strategy, summary indexing is another. Tuning the search itself is another. Using tstats instead of stats is another. All those options ought to be on the table, the last one first, when you are looking to speed up your searches.

Splunk Employee
Splunk Employee

Hi kiril123--

To answer this question, it's probably helpful to revisit what report acceleration does. Say you have an unaccelerated search that runs over the last month, and when it does, it searches through an average of 500k events to return maybe 300 events, which means it has low cardinality--only a few events from the original set are returned. This search takes awhile to complete because it has to look through all 500k events to find those 300 events.

So you accelerate that search. This starts a search running in the background. It runs on a schedule using the same criteria as the original search, and builds a summary of just the results the original search was looking for. This summary will be far smaller than 500k events--it may contain just a few thousand events at any given time. The next time you run your accelerated search, it runs against that summary. Because the summary is far smaller than the original search set, the search should complete much faster than it did before.

Ok, now imagine you have a second unaccelerated search that, like the first search, runs over the past month, and when you run it, it also searches through an average of 500k events. But this search has far higher cardinality than the first search: each time you run it, it matches at least 50k events, maybe more. This second search is slow to complete as well, because it also has to look through 500k events each time it runs.

However, when you accelerate the second search, it ends up with a much larger summary than the first search. This is because the second search has high cardinality. Each time its background search runs, it adds at least 50k events to the summary. If the summary has a range of 3 months, that's an average of 150k events at any given time. When you run the accelerated search, it runs against that summary. It will complete faster than it did before, but not that much faster, because it's still running over a lot of events. Its report acceleration summary just isn't much of a summary.

So the lesson here is--when you must search across large volumes of data, try to design low cardinality searches--searches that return significantly fewer events than the total amount of events searched. Then, when you accelerate them, you'll get searches that are actually accelerated.

I'll see if I can rewrite the documentation to clarify this point.

View solution in original post

Path Finder

Thank you for the great answer.

You have covered two cases:

  1. low cardinality--only a few events from the original set are returned >> very good candidate for report acceleration.

  2. high cardinality--a lot events from the original set are returned >> although technically possible, but not very good candidate for report acceleration.

What about the third case where you have high cardinality, followed by low cadinality. Something like the search below:

index = test | stats x | stats y

Where "stats x" returns high cardinality data and "stats y" returns low cardinality data

Does report acceleration make sense in this case?

Splunk Employee
Splunk Employee

At the end of the day, the only way to know for sure is to accelerate the report and see if you get a performance improvement that makes the cost of doing the acceleration* worth it. This guideline in the docs is mainly there to explain why some report accelerations give you larger performance gains than others.

*Cost = the cost of having a scheduled search run in the background to build the summary, and the cost in terms of disk storage space for the summary itself.

Super Champion

My first thought is just that there may not be much help with that specific search string because it is just two values in the end: avg and distinct_ids. If you took off the second stats command, i think it could be useful and the acceleration would still work.

I've linked the doc you're referring to so that anyone else reading this can reference.
http://docs.splunk.com/Documentation/Splunk/7.0.0/Knowledge/Manageacceleratedsearchsummaries

0 Karma