Just recently setup smartstore in a test environment using a single index and I'm trying to figure out some details on the bucket lifecycle.
So far I know that hot buckets are stored locally and when they roll to warm, some that are in use or reserved stay local, and then warm buckets are uploaded to the S3 remote storage (AWS in my case). When a search is performed and the data needed is in a warm bucket in remote storage, it is copied from S3 to the indexer's cache manager and gets evicted when no longer needed.
From what I've seen in their documentation, smartstore warm buckets in S3 will be frozen and if they are not being locally archived with the coldToFrozenDir setting, then they will be deleted. As I got further into the documentation, they say contradicting things like the process is only hot to warm, or the warm buckets will roll to frozen to be deleted locally and from remote storage, or warm buckets on remote storage will roll from warm to frozen and archived on the indexer if archiving is setup.
I don't want to send data to remote storage and then bring it back onto the indexer for archiving locally. To prevent freezing/deletion, I haven't configured any frozen time or max size settings.
In the following page for smartstore indexing, it says Hot and Warm only and then randomly says, "Buckets roll to frozen directly from warm" without mentioning anything else in the page about freezing buckets which made me think that it's automatic for smartstore. https://docs.splunk.com/Documentation/Splunk/7.3.1/Indexer/SmartStoreindexing
With the following example indexes.conf, when the buckets roll from hot to warm and are uploaded to AWS S3 , will the splunk ever automatically roll it to frozen?
[default]
maxHotSpanSecs=86400
[volume:s3]
storageType = remote
path = s3://
[smartstore_test]
repFactor=auto
homePath = $SPLUNK_DB/$_index_name/db
coldPath = $SPLUNK_DB/$_index_name/colddb
thawedPath = $SPLUNK_DB/$_index_name/thaweddb
remotePath = volume:s3/$_index_name
If warm to frozen is inevitable, can smartstore manage archiving to the S3 bucket?
If you have SmartStore configured and have any suggestions or examples of things that you wish you knew when you first set it up that you would like to share, I would greatly appreciate the input.
Thank you ahead of time and sorry for the long post.
> I don't want to send data to remote storage and then bring it back onto the indexer for archiving locally.
And:
> When a bucket rolls from warm to frozen, cache manager will download the warm bucket from the indexes prefix withing the S3 bucket to one of the indexers, splunk will then take path to the bucket and pass it to the cold to frozen script for archiving which places the archive in the S3 bucket under archives.
Can you elaborate a bit on this? At first you mention you don't want to upload to S3 and then download for archiving locally, but that appears how you solved the problem.
I see that archives go back to S3, so that's not archiving locally in terms of where archives get stored, but it's archiving locally in terms of where it happens (as in, you still pay S3 egress fees, which I thought was the main reason for coming up with a workaround/solution). Wouldn't it be better to just leave them there?
> When archiving is successful, cache manager will delete the local and remote copies of the warm bucket.
Your data is eventually still on S3 and it would be evicted from cache for good (presumably no one needs to search it, which is why it's being archived), so what's the benefit?
> Smartstore will roll the buckets to frozen by default unless you set frozen time to 0 which will leave all warm buckets in S3. I didn't want that as a long term solution
I wonder why. I like the creative approach, but I'm curious about the non-technical value (cost, special use case, business rules, something else) you get in return for the possibly additional/unnecessary egress fees.
Late to update this, but resolved my issue. Smartstore will roll the buckets to frozen by default unless you set frozen time to 0 which will leave all warm buckets in S3.
I didn't want that as a long term solution, but I was able to get SmartStore to handle the archiving using a bash script. (tried using a python script and kept running into errors with the bucket copies.
Here is what I did:
1. Created a smartstore index
2. Created an S3 bucket called something like splunk-smartstore
3. In the S3 bucket I created prefixes, one for indexes and one for frozen archives
4. Created a coldToFrozen bash script and deployed it as an app on the indexer cluster.
5. Splunk servers use a role to authenticate to S3 bucket
6. Once all was in place, just sent data to the smartstore index and it handled the rest.
I found that with Smartstore indexes on a cluster, the cache manager will handle the logic about what server does the archiving of a particular bucket. When a bucket rolls from warm to frozen, cache manager will download the warm bucket from the indexes prefix withing the S3 bucket to one of the indexers, splunk will then take path to the bucket and pass it to the cold to frozen script for archiving which places the archive in the S3 bucket under archives. When archiving is successful, cache manager will delete the local and remote copies of the warm bucket.
Below is the script I used. This was based off someone else's example in another thread, but I modified it for what I needed. I wanted to easily be able to go back to S3 and pull an archive based on specific date ranges and add in some organization to where things get stored in S3. The script will use the bucket path to create directories in S3 archives and convert Epoch time to EST. So far this has been working for about two months and no issues or data loss. And the disk space has stayed low as only hot buckets and warm bucket caches are on the disk.
#!/bin/bash
set -e
set -u
export HTTP_PROXY=http://<Proxy_IP>:<Proxy_Port>/
export HTTPS_PROXY=https://<Proxy_IP>:<Proxy_Port>/
export NO_PROXY=169.254.169.254
bucket=$1
instance=$(hostname -s)
region=<AWS REGION>
s3bucket=<Smartstore_S3_Bucket>
NOW=$(date +"%Y-%m-%d")
LOG=/opt/splunk/var/log/splunk/coldToFrozen-${NOW}.log
#Gets index name and warm bucket name from path passed by splunk
index=$(echo $1 | cut -f7 -d"/")
warm=$(echo $1 | cut -f9 -d"/")
#Converts the epoch time from the warm bucket name to EST
startEpoch=$(echo $warm | cut -f3 -d"_")
endEpoch=$(echo $warm | cut -f2 -d"_")
startDate=$(date -d @$startEpoch '+%m_%d_%Y')
endDate=$(date -d @$endEpoch '+%m_%d_%Y')
#Sets AWS Sginature Version - Needed for S3 KMS-SSE
aws configure set s3.signature_version s3v4
#Creates log file
touch ${LOG}
echo "bucket to move: " $bucket >> $LOG
#Copies bucket to S3 and logs the output along with timestamps
/usr/bin/aws s3 cp ${bucket} s3://${s3bucket}/frozen/${index}/${startDate}_to_${endDate}/${warm} --recursive --region ${region} 2>&1 | tr "\r" "\n" > >(awk '{print strftime("%Y-%m-%d:%H:%M:%S ") $0}' >> $LOG)