Deployment Architecture

maxdatasize tuning with multiple indexers

supersleepwalke
Communicator

I'm trying to properly tune my indexes. I have one index that receives about 20 GB per day. I was planning to use

maxDataSize=auto_high_volume

but then I started thinking about the fact that I have this data split across 5 indexers (they are 64-bit machines). So, each indexer only receives about 4 GB per day. At that rate, if I used auto high volume, which defaults to 10 GB buckets, it would take over 2 days to fill a bucket and have it roll. I know the docs say

* You should use "auto_high_volume" for high volume indexes (such as the main
  index); otherwise, use "auto".  A "high volume index" would typically be
  considered one that gets over 10GB of data per day.

However, I'm wondering if that really means 10 GB of data per indexer per day.

1 Solution

gkanapathy
Splunk Employee
Splunk Employee

Yes, that it what it means.

However, I would recommend you just use auto_high_volume anyway, even with 2 or 4 GB/day in a bucket. The only reason you might want an bucket to roll any earlier would be so that you can archive it away, or otherwise create a backup of the data by copying the warm buckets. However, as of 4.2 and up, you can in fact back up hot buckets by copying only the "rawdata" folder inside the hot bucket (ignore the *.tsidx files). Unlike a backup of a warm bucket, this does require a rebuild to be usable, but it does contain all the data required to rebuild it as of the time of backup.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

Yes, that it what it means.

However, I would recommend you just use auto_high_volume anyway, even with 2 or 4 GB/day in a bucket. The only reason you might want an bucket to roll any earlier would be so that you can archive it away, or otherwise create a backup of the data by copying the warm buckets. However, as of 4.2 and up, you can in fact back up hot buckets by copying only the "rawdata" folder inside the hot bucket (ignore the *.tsidx files). Unlike a backup of a warm bucket, this does require a rebuild to be usable, but it does contain all the data required to rebuild it as of the time of backup.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

No, that's not how search works.

0 Karma

supersleepwalke
Communicator

Re: preferring smaller buckets. What about time windows? If the normal bucket size ended up giving me roughly 1 bucket per day, wouldn't larger buckets mean I'd have to search deeper into a single bucket for a query where I was looking at a single day of data? On the flip side, opening 10 times as many files doesn't seem like too much more overhead to do when the bottleneck is really reading them.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

No, there is never a reason to prefer smaller buckets. There may be reasons to need higher maxHotBuckets, and in general you should always set that to at least 4 and preferably 10. However, data being 5 minutes apart isn't much of a problem, and certainly won't be with 4 hot buckets.

0 Karma

supersleepwalke
Communicator

I'm not sure if this affects the answer, but this index has sources from 6 different servers, and they're dumped every 5 minutes, instead of live, but don't all arrive at the same time. So, the times will not always be exactly sequential, but always within 5 minutes of each other. I've read that this is a situation where you might want either small buckets and or a larger maxHotBuckets.

With that extra information, what's your recommendation on both maxHotBuckets and maxDataSize?

Also, is there any concern with search efficiency with how we configure those parameters?

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!