I'm trying to properly tune my indexes. I have one index that receives about 20 GB per day. I was planning to use
maxDataSize=auto_high_volume
but then I started thinking about the fact that I have this data split across 5 indexers (they are 64-bit machines). So, each indexer only receives about 4 GB per day. At that rate, if I used auto high volume
, which defaults to 10 GB buckets, it would take over 2 days to fill a bucket and have it roll. I know the docs say
* You should use "auto_high_volume" for high volume indexes (such as the main
index); otherwise, use "auto". A "high volume index" would typically be
considered one that gets over 10GB of data per day.
However, I'm wondering if that really means 10 GB of data per indexer per day.
Yes, that it what it means.
However, I would recommend you just use auto_high_volume
anyway, even with 2 or 4 GB/day in a bucket. The only reason you might want an bucket to roll any earlier would be so that you can archive it away, or otherwise create a backup of the data by copying the warm buckets. However, as of 4.2 and up, you can in fact back up hot buckets by copying only the "rawdata" folder inside the hot bucket (ignore the *.tsidx files). Unlike a backup of a warm bucket, this does require a rebuild to be usable, but it does contain all the data required to rebuild it as of the time of backup.
Yes, that it what it means.
However, I would recommend you just use auto_high_volume
anyway, even with 2 or 4 GB/day in a bucket. The only reason you might want an bucket to roll any earlier would be so that you can archive it away, or otherwise create a backup of the data by copying the warm buckets. However, as of 4.2 and up, you can in fact back up hot buckets by copying only the "rawdata" folder inside the hot bucket (ignore the *.tsidx files). Unlike a backup of a warm bucket, this does require a rebuild to be usable, but it does contain all the data required to rebuild it as of the time of backup.
No, that's not how search works.
Re: preferring smaller buckets. What about time windows? If the normal bucket size ended up giving me roughly 1 bucket per day, wouldn't larger buckets mean I'd have to search deeper into a single bucket for a query where I was looking at a single day of data? On the flip side, opening 10 times as many files doesn't seem like too much more overhead to do when the bottleneck is really reading them.
No, there is never a reason to prefer smaller buckets. There may be reasons to need higher maxHotBuckets, and in general you should always set that to at least 4 and preferably 10. However, data being 5 minutes apart isn't much of a problem, and certainly won't be with 4 hot buckets.
I'm not sure if this affects the answer, but this index has sources from 6 different servers, and they're dumped every 5 minutes, instead of live, but don't all arrive at the same time. So, the times will not always be exactly sequential, but always within 5 minutes of each other. I've read that this is a situation where you might want either small buckets and or a larger maxHotBuckets.
With that extra information, what's your recommendation on both maxHotBuckets and maxDataSize?
Also, is there any concern with search efficiency with how we configure those parameters?