Is it possible to have WarmData stored partially on local indexers' storage and partially on remote storage?
My total retention period is 90 days but I would like to keep data for the first 45 days on local indexers as warm buckets and then send the warm buckets to AWS S3 bucket to store for another 45 days.
I am using the below settings and the data sent to the S3 afrer 90 days and is stored in frozen instead of warm or cold. How can I set warm or cold data in S3?
[default]
remotePath = volume:remote_store/$_index_name
repFactor = auto
frozenTimePeriodInSecs = 7776000
If I change the frozenTimePeriodInSecs = 1555200 (45 days) the data will be sent to S3 after 45 days but it will be sent in frozen buckets. I need to send data either as warm or cold buckets.
If I setup
[volume:local_store]
maxVolumeDataSizeMB = 1000000
That will send data to S3 only after filling 1 TB in local storage.
How can I maintain a time based storage with 45 days on local storage and 45 days on remote storage?
"smartstore" and "AWS S3" are not the same thing. SmartStore (S2) is a Splunk feature that separates storage from compute. It relies on storage providers that follow the S3 standard, but does not have to use AWS. AWS S3 is just one provider of on-line storage, but it is not suitable for storing live Splunk indexes.
The reason S2 gets away with using AWS S3 for storing indexes is because it keeps a cache of indexed data local to the indexers. It's this cache that serves search requests; any data not in the cache has to be fetched from AWS before it can be searched.
Note that S2 stores ALL warm data. There is no cold data with S2.
All warm data remains where it is until it is rolled to cold or frozen. There is no way for warm buckets to reside in one place for some period of time and some place else for another period of time.
Freezing data either deletes it or moves it to an archive. Splunk cannot search frozen data.
To keep data local for 45 days and remote for 45 days would mean having a hot/warm period of 45 days and a cold period of 45 days. Note that each period is measured based on the age of the newest event in the bucket rather than when the data moved to each respective storage tier.
Data moves from warm to cold based on size rather than time so you would have to configure the buckets so they hold a day of data and then size the volume so it hold 45 days of buckets. Configure the cold volume to be remote. Splunk will move the buckets to the cold volume as the warm volume fills up. Data will remain in cold (remote) storage until it expires (frozenTimePeriodInSecs=3110400 (90 days)).
So here is my understanding and the way that I've got our on-prem instance configured.
hot buckets are stored on a local flash array. When the bucket closes, it keeps the closed bucket on the flash drive and writes a copy to the S3 storage. The S3 storage copy is considered to be the 'master copy'. I try not to use the term 'warm bucket', but instead use 'cached bucket'. All searches are performed on either hot or cached buckets on the local flash array. Cached buckets are eligible for eviction from local storage by the cache manager. So if your search needs a bucket that is not on the local storage, it will evict eligible cached buckets, retrieve the buckets from S3 storage and then perform the search.
The frozenTimePeriod defines our overall retention time. We use hotlist_recency_secs to define when a cached bucket is eligible for eviction. That is. buckets less than the hotlist_recency_secs age are not eligible for eviction. Our statistics show that probably 90% of the queries have a time span of 7 days or less (research gosplunk.com for query). Thus, by setting the hotlist_recency_sec to 14 days, we are ensured that the search buckets are on local, searchable storage w/o having to reach out to the S3 storage (which is slower).
One last thing. We need a 1 yr searchable retention. However, we also need to keep 30 months total retention. To accomplish this, I use ingest actions to the S3 storage. Ingest actions will write the events in compressed json format by year, months, day, and sourcetype.
Hope this helps.