Getting Data In

What is the disadvantage of having a lot of small buckets and rotating them frequently?

Contributor

So I understand that the minimum timespan on a hot bucket is 1 hour, but bucket sizing defaults to a file size instead of a timespan. It is also warned that setting bucket sizes too small will yield "too many buckets". It seems that the implicit guidance is for larger bucket sizes and fewer of them. However, this seems counter intuitive as having lots of small buckets would seem to imply less searching would be required.

Am I missing something? What is the disadvantage of having lots of small buckets and rotating them frequently besides file count? Do you lose compression, do the tsidx files go crazy and eat all the disk, what actually happens? Has anyone gone against the grain and implemented small bucketing with fast rotation?

0 Karma
1 Solution

Legend

I think that you may have a misconception. Smaller buckets do not imply less searching, but just the opposite. Here is an example.

For the first index, assume that you are indexing 1GB buckets. The index has 25 buckets that contain data from the past 30 days, so each bucket contains data for a bit over one day on average. You run a search that covers the past 24 hours. Splunk quickly identifies the bucket or two - depending on when the buckets last rolled - that contains the data from the past 24 hours. Now Splunk looks at 1 or 2 sets of tsidx files to locate the data.

For the second index, assume that you are indexing 100MB buckets. The index has approximately 250 buckets from the past 30 days, or approximately 8 buckets per day. Running the same search over the past 24 hours means that Splunk will first identify the approximately 8 to 10 buckets that could contain the data. Now Splunk must examine 8-10 sets of tsidx files to locate the data.

So the second index requires approximately 5x as many tsidx file searches for the same time range. In both cases, once the tsidx files have been searched, the actual data retrieval should take about the same time.

For smaller bucket sizes, the ratio of rawdata to index files will be worse as well, although I don't think it will be as significant as the search impact.

Personally, my rule of thumb is to size the buckets so that they contain approximately 1 day's worth of data for low-volume indexes. For high-volume indexes, I generally use the "auto_high_volume" setting of 10GB. (A "high-volume" index is one that receives 10GB or more per day.) The documentation in indexes.conf.spec says that a reasonable number ranges anywhere from 100MB to 50GB; that's a pretty wide range.

Finally, you should probably consider your search patterns. In my experience, the vast majority of user searches are for a time range of 24 hours or less. But reporting searches often cover the last 30 days. Your search patterns could be different.

View solution in original post

Legend

I think that you may have a misconception. Smaller buckets do not imply less searching, but just the opposite. Here is an example.

For the first index, assume that you are indexing 1GB buckets. The index has 25 buckets that contain data from the past 30 days, so each bucket contains data for a bit over one day on average. You run a search that covers the past 24 hours. Splunk quickly identifies the bucket or two - depending on when the buckets last rolled - that contains the data from the past 24 hours. Now Splunk looks at 1 or 2 sets of tsidx files to locate the data.

For the second index, assume that you are indexing 100MB buckets. The index has approximately 250 buckets from the past 30 days, or approximately 8 buckets per day. Running the same search over the past 24 hours means that Splunk will first identify the approximately 8 to 10 buckets that could contain the data. Now Splunk must examine 8-10 sets of tsidx files to locate the data.

So the second index requires approximately 5x as many tsidx file searches for the same time range. In both cases, once the tsidx files have been searched, the actual data retrieval should take about the same time.

For smaller bucket sizes, the ratio of rawdata to index files will be worse as well, although I don't think it will be as significant as the search impact.

Personally, my rule of thumb is to size the buckets so that they contain approximately 1 day's worth of data for low-volume indexes. For high-volume indexes, I generally use the "auto_high_volume" setting of 10GB. (A "high-volume" index is one that receives 10GB or more per day.) The documentation in indexes.conf.spec says that a reasonable number ranges anywhere from 100MB to 50GB; that's a pretty wide range.

Finally, you should probably consider your search patterns. In my experience, the vast majority of user searches are for a time range of 24 hours or less. But reporting searches often cover the last 30 days. Your search patterns could be different.

View solution in original post

Explorer

We're in the process of setting maxDataSize due to the fact that some hot buckets are groing too large. We only have hot and warm storage, still working our way in having some sort of cold storage.

As of today we control the size of the index with maxTotalDataSizeMB (i.e. max size per indexer taking in consideration the number of replicas) and 
frozenTimePeriodInSecs but the end result is that when the buckets are frozen huge chuncks of data go away (like 30 days in some cases).
 
My doubt here is, should we mess with the maxHotBuckets and maxWarmDBCount setting cause we are going to have a lot of 1 day buckets instead of fat buckets that span multiple days. Or should we follow the mantra DON'T EDIT UNLESS YOU'RE TOLD TO?
 
Another question would be, for setting maxDataSize the method is to pick the ingestion per day number and divide per the number of indexers in the cluster? since the forwarders loadbalance between all of the indexers this seems the most reasonable approach to take?
0 Karma

Explorer

@soutamo could you kindly share your thoughts on this matter? 

0 Karma

SplunkTrust
SplunkTrust
Hi

As you can see e.g. from @iguinn2 answer this is not so simple question, so don't wait a simple answer unless you are happy with "it depends" 😉

Without smartstore I play also with auto and auto_high_volumes and try to keep bucket size appr. 1 day. But if/when you take SmartStore into use then your only option is use auto with those indexes. Without it I like those thoughts what @iguinn2 and @ltrand have.
r. Ismo
0 Karma

Explorer

Hi,

I understand that it depends on the ingestion rate and the search patterns so, for the most part, i'm happy with "it depends" @soutamo  😉

The grey area for me is either or not I should compensate for the increase on the number of buckets that may result from adjusting to 1 day buckets it is general guidance to change the default values for maxHotBuckets and maxWarmDBCount?

0 Karma

Legend

I would be curious to know what others think about this as well.

0 Karma

Contributor

Thanks for the response. I notice the majority of my searches are past 1 hour, then past 24 hours is next common. After that it's mostly reporting on long-term data, but I'm not sure what the actual measurable impact is. Like is it 1 second for each tsidx file, and an additional 100 I/O for each file open/close operation?

I was thinking that a 1 hour bucket would be best because that's where most searching & alerting comes from. I'd like to have 24 hours of data on SSD, but if buckets are 24 hours large, then they won't roll to cold (HDD) until the oldest event is 48 hours old and then it's a large operation. My thinking is that with 1 hour buckets strictly enforced, that rolling would occur every hour and then it's only 25 hours of storage.

From a search performance perspective, what is an acceptable search time that people accept? Especially on 30 days? Currently we're set to 10GB buckets and searches on the largest datasets are currently measured in minutes for large reports, sometimes hours. We could improve it with summary indexing, but those are built cummulatively on the most recent data (less than 24 hours), not the older data, thus I would think that less data in the most recent buckets would be critical.

So what I see in my mind is large buckets for summary indexes (holding a day of data for large sources or more for smaller sources) whereas raw events would go into hour buckets for smoother rolling and keeping buckets sized to the searches. Is that really what the guidance is about, keeping the fewest buckets & events within the most common search timespan and balancing out the 70/20/10% demands?

0 Karma

Legend

IMO you are thinking about the right things. You may also want to consider the overall size of your index, as well as the bucket size. How many events in a typical 30-day search? A 24-hour search?

Also, it is not a huge operation to change bucket sizes and experiment. However, if you change the bucket size in indexes.conf - the change will only affect future buckets. Also, you will need to restart Splunk for the changes to take effect.

If you are having search performance issues, I would look at a lot of other things besides just the bucket sizes:
- Which searches are the slowest? Try improving these searches. Typically, there are a few searches that are the worst offenders, while most searches are okay.
- For routine searches, try scheduled searches. Since these run in the background, it doesn't matter as much if they run long. These are particularly good for backing dashboards.
- Avoid summary indexing; use report acceleration if possible. Report acceleration is self-healing and deals with out-of-order event arrival - summary indexing requires manual intervention. Just make your "summary range" for acceleration as small as you reasonably can.

There are also other techniques for making certain kinds of searches run faster. You might consider posting a typical "slow search" to this forum and see what people suggest.

Here is a link to some good information about acceleration in general:
http://docs.splunk.com/Documentation/Splunk/6.1/Knowledge/Aboutsummaryindexing

0 Karma

Contributor

Yeah, I push and used scheduled searches & report acceleration where possible. I still have cases where summary indexes makes sense because the reports change enough in their requirements & output (ie yesterday I wanted top 10, today I want all unique, tomorrow I may want daily average, one person wants view x, other wants x plus y) that it makes sense to chop down the raw data and keep a long running summary.

The bucket management is about making the underlying platform not get in the way and jam up things and to put the best resources against what my users are doing. I'd really like to RAM cache hot buckets and roll warm to SSD, then move them to 10k for 90 days and attached JBOD for 1 year, but Splunk doesn't allow for it at this time. What I have today is SSD on each indexer, but it's limited in size so I want to make sure I don't overrun it. Looks like I'll have to just experiment. Thanks!

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!