I'd like to better understand what behaviors SmartStore is going to exhibit in my environment, and how do I manage them? What can I do to prepare my environment for SmartStore?
S2 behaviors in no particular order. I will update this post as new information is learned.
Evictions don’t always seem to show up in MC on the S2 pages. The following will.
index=_internal sourcetype=splunkd source=*splunkd.log action=evictDeletes
Starting in 7.2.4, additional metrics were added to be able to count downloaded byte count. Prior to this version, Splunk was metrics-blind to the (potentially significant) impact on the network/storage a rolling restart induces.
During a rolling restart, as each indexer is marked to go down
CM begins to reassign primacy for buckets on the indexer on the way down to other indexers
All buckets on indexer being restarted are marked for eviction, effectively flushing the cache on the indexer being restarted
As indexers in the cluster are restarted, others will start d/ling buckets from S3 to satisfy search requests, which can take a heavy toll on local network and storage if not prepared for this level of data transfer in a short period of time, as all other indexers not being restarted will likely start requesting buckets to download at once.
SmartStore only allows one indexer at a time to be primary searchable for a bucket and no other indexers are allowed to have copies of that bucket cached. The CM will issue eviction notices to any indexers with copies of that bucket locally. This ensures that only 1 indexer will search that bucket and return results. As a result of this, there is a huge amount of data shuffling and downloading that happens during a full cluster rolling restart.
Bucket rebalance works more quickly with S2 than without it because the only buckets to rebalance are hot buckets
Added Nov 2019
Weird. Per developer, it's not supposed to work that way. I'll follow up and report back.
Ah, this isn't really the case, but I can see how it might appear this way. There is now only "hot" and "not hot" in terms of a bucket lifecycle in S2. The concept of warm and cold being separate is no longer really a thing.
Hot (read/write) is still replicated based on CM RF/SF settings until it rolls to read-only, and then 1 copy is made of the bucket to S3, and the other local copies are marked for deletion by the indexers' cachemanager process.
The cachemanager retrieves read-only buckets from S3 when it needs to so a search can be completed and those bucket share the same file system as hot...so make sure your hot/cachemanager filesystem is nice and fast.
I'm not sure I follow. You don't have a choice of WARM or COLD with S2. There is HOT; briefly there is WARM while waiting to upload to remote; and finally there is remote with cached local copies. Th entire bucket lifecycle changes.
At least this is my understanding.
S2 behaviors in no particular order. I will update this post as new information is learned.
Evictions don’t always seem to show up in MC on the S2 pages. The following will.
index=_internal sourcetype=splunkd source=*splunkd.log action=evictDeletes
Starting in 7.2.4, additional metrics were added to be able to count downloaded byte count. Prior to this version, Splunk was metrics-blind to the (potentially significant) impact on the network/storage a rolling restart induces.
During a rolling restart, as each indexer is marked to go down
CM begins to reassign primacy for buckets on the indexer on the way down to other indexers
All buckets on indexer being restarted are marked for eviction, effectively flushing the cache on the indexer being restarted
As indexers in the cluster are restarted, others will start d/ling buckets from S3 to satisfy search requests, which can take a heavy toll on local network and storage if not prepared for this level of data transfer in a short period of time, as all other indexers not being restarted will likely start requesting buckets to download at once.
SmartStore only allows one indexer at a time to be primary searchable for a bucket and no other indexers are allowed to have copies of that bucket cached. The CM will issue eviction notices to any indexers with copies of that bucket locally. This ensures that only 1 indexer will search that bucket and return results. As a result of this, there is a huge amount of data shuffling and downloading that happens during a full cluster rolling restart.
Bucket rebalance works more quickly with S2 than without it because the only buckets to rebalance are hot buckets
Added Nov 2019
Hi David, this is a great session.
Today, one Splunk instance identified some issues with smartstore on top of on-prem object storage. It worked normal since smartstore was enabled several months ago. Most of the time, the indexing rate per indexer is about 8-10MB/s. But, while there was a spike (not sure how much yet), indexer processor was stuck and consuming 100% CPU on indexer. All pipelines were blocked and couldn't be recovered. Indexing rate dropped to 2MB/s. They restarted the indexer. It went back to normal with index rate of 16MB/s.
Around 20min before the congestion, Some errors like "DatabaseDirectoryManager - failed to open bucket/waif for bucket to be local through CacheManager" started to be reported by indexer.
Their hot buckets are on SSD without RAID.
Any thought on this case?
MC showed the major cause was ChillOrFreeze in indexer. But, the total data stored in smartstore was way below the maxGlobalDataSizeMB.
There was a very serious bug in the SmartStore
code that caused buckets to be accidentally deleted. See the (absurdly vague) headline regarding might impact data durability in certain rare ...
in Fixed Issues
here:
https://docs.splunk.com/Documentation/Splunk/7.2.4/ReleaseNotes/Fixedissues
In my logs I see "deletes" files being downloaded, what is the deletes file in the bucket used for? Thanks
That file is where the info is stored to block events from showing up in search that have had "|delete" run against them in the past.
This is a really good rundown for anyone planning to use S2. Thanks for the summary @davidpaper!
@SloshBurch - We need a best practice wizard in here.
Thanks @woodcock. I hope to tackle smartstore soon and will revisit this at this time.