Knowledge Management

[SmartStore] Understand Retention with SmartStore?

Splunk Employee
Splunk Employee

We are planning to migrate to Smartstore and looking to understand the retention changes that come with it?

Tags (1)
0 Karma

Splunk Employee
Splunk Employee

Starting 7.3 Splunk ensures that bucket copies are not evicted on target indexers after the hot bucket rolls to warm and is uploaded by source.

However for buckets that are already warm and that are being converted to s2 and are being uploaded by either the source or the targets, they would not be evicted on upload irrespective of the version. buckets would only be evicted eventually when there is cache pressure

0 Karma

Splunk Employee
Splunk Employee

Just like classic indexer clustering, for both clustered and non-clustered s2-enabled environments we need retention policies to freeze buckets. But unlike non-s2 environments, where we used to freeze and delete the buckets once they are eligible for freezing, we have to delete the buckets both locally and from remote storage so that all the bucket states are in sync and we don't end up introducing the bucket accidentally in the cluster again. This is taken care of by CMMasterRemoteStorageThread which runs every remote_storage_retention_period (defaults to 15 minutes) to check if there are buckets in remote storage that needs to be frozen in the cluster.

Below is how CMMasterRemoteStorageThread runs retention procedure -

i). It runs a search on all peers to retrieve the list of remote indexes with frozenTimePeriodInSecs, maxGlobalDataSizeMB and maxGlobalRawDataSizeMB information
ii). It then runs a search on all the peers to retrieve the list of the warm buckets that need to be frozen based on the frozenTimePeriodInSecs, maxGlobalDataSizeMB and maxGlobalRawDataSizeMB thresholds
iii). Does the following for each bucket:

Retrieve peers that have this bucket from CM
Assign this bucket to a peer that was randomly picked from the retrieved peer list

iv). For each peer that has bucket(s) assigned: Hit the clusterslavecontrol endpoint with the list of the buckets i.e. "/services/cluster/slave/control/control/freeze-buckets/" endpoint.
v). peer freezes the bucket and sends a notification to CM via "/services/cluster/master/control/control/freeze-buckets/".
vi). CM sends the notification back to the other peers which then freeze the bucket on their end by running a batch_bucket_frozen job which posts https request to peers via "/services/cluster/slave/control/control/freeze_buckets_by_list" endpoint.

Retention in S2 works on two policies -

a). Size based retention : This is controlled by maxGlobalDataSizeMB in indexes.conf. If an index crosses this threshold of index size across all indexers in a cluster, then the oldest data is frozen. This always take precedence over time-based retention i.e. frozenTimePeriodInSecs.

maxGlobalDataSizeMB = <nonnegative integer>
* The maximum amount of local disk space (in MB) that a remote storage
  enabled index can occupy, shared across all peers in the cluster.
* This attribute controls the disk space that the index occupies on the peers
  only. It does not control the space that the index occupies on remote storage.
* If the size that an index occupies across all peers exceeds the maximum size,
  the oldest data is frozen.
* For example, assume that the attribute is set to 500 for a four-peer cluster,
  and each peer holds a 100 MB bucket for the index. If a new bucket of size
  200 MB is then added to one of the peers, the cluster freezes the oldest bucket
  in the cluster, no matter which peer the bucket resides on.
* This value applies to hot, warm and cold buckets. It does not apply to
  thawed buckets.
* The maximum allowable value is 4294967295
* Defaults to 0, which means that it does not limit the space that the index
  can occupy on the peers.

b). Time-based retention : This is controlled by frozenTimePeriodInSecs in indexes.conf. Once the bucket is older then frozenTimePeriodInSecs, CMMasterRemoteStorageThread will then freeze the bucket across the cluster, both locally and on remote storage.

frozenTimePeriodInSecs = <nonnegative integer>
* Number of seconds after which indexed data rolls to frozen.
* If you do not specify a coldToFrozenScript, data is deleted when rolled to
  frozen.
* IMPORTANT: Every event in the DB must be older than frozenTimePeriodInSecs
  before it will roll. Then, the DB will be frozen the next time splunkd
  checks (based on rotatePeriodInSecs attribute).
* Highest legal value is 4294967295
* Defaults to 188697600 (6 years).

Here is flow with an example of "_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC"

i). CMMasterRemoteStorageThread retrieves the list of remote storage indexes just as mentioned in 3.i)

05-21-2019 02:24:00.292 +0000 INFO  CMMasterRemoteStorageThread - retrieving remote indexes info with search=| rest services/data/indexes datatype=all f=title f=frozenTimePeriodInSecs f=maxGlobalDataSizeMB f=remotePath f=disabled| search remotePath!="" AND disabled!=1| dedup title| fields title frozenTimePeriodInSecs maxGlobalDataSizeMB

ii) CMMasterRemoteStorageThread then retrieves the list of the warm buckets that need to be frozen just as mentioned in 3.ii)

05-21-2019 02:24:00.846 +0000 INFO  CMMasterRemoteStorageThread - Will initiate retrieving the list of buckets to be frozen for remote storage retention for index=_internal with frozenTimePeriodInSecs=2592000 and maxGlobalDataSizeMB=0
05-21-2019 02:24:00.846 +0000 INFO  CMMasterRemoteStorageThread - retrieving the list of buckets to be frozen for remote storage retention for index=_internal with search=| dbinspect index=_internal cached=true timeformat=%s| search state=warm OR state=cold| search modTime != 1| stats max(endEpoch) AS endEpoch BY bucketId| sort -endEpoch| search endEpoch<1555813440| fields bucketId, endEpoch

iii). It then retrieves the list of peers that have this bucket and then picks an indexer randomly to start freezing of bucket from remote storage as well as locally, in our case it is iTEST_INDEXER1:8089

05-21-2019 02:24:00.945 +0000 INFO  CMMasterRemoteStorageThread - Freezing bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC based on frozenTimePeriodInSecs
0505-21-2019 02:24:01.040 +0000 INFO  CMMasterRemoteStorageThread -  freezing buckets=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC as part of remote storage retention on hp=TEST_INDEXER:8089

iv). CM posts an http request to the indexer to freeze the bucket, as mentioned in 3.iv). Below entry is logged on indexer and the i.p XX.X.XX.XXX is of CM.

XX.X.XX.XXX - splunk-system-user [21/May/2019:02:24:01.046 +0000] "POST /services/cluster/slave/control/control/freeze-buckets HTTP/1.1" 200 1804 - - - 1ms

v). Peer now freezes the bucket and removes the bucket both from a remote storage and locally, as mentioned in 3.v)

05-21-2019 02:24:01.047 +0000 INFO  DatabaseDirectoryManager - cid="bid|_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC|" found to be on remote storage
05-21-2019 02:24:01.047 +0000 INFO  BucketMover - Will freeze bkt=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC path='/opt/splunk/var/lib/splunk/_internaldb/db/db_1555290488_1554859220_697_9EA978AA-B109-473E-A526-F09AEE391FCC'
05-21-2019 02:24:01.047 +0000 INFO  BucketMover - RemoteStorageAsyncFreezer trying to freeze bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC, freezeInitiatedByAnotherPeer=false
05-21-2019 02:24:01.047 +0000 INFO  CacheManager - Evicted cachedId=bid|_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC| freed_space=0 reason= manual earliest_time=1554859220 latest_time=1555290488
05-21-2019 02:24:01.047 +0000 INFO  CacheManager - will remove cacheId="bid|_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC|" removeRemote=1
05-21-2019 02:24:01.303 +0000 INFO  CMSlave - deleteBucket bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC, frozen=true
05-21-2019 02:24:01.303 +0000 INFO  BucketMover - RemoteStorageAsyncFreezer freeze completed succesfully for bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC

NOTE: Couple of interesting things which can help in finding which peer initiated the freeze of the bucket on remote storage -

a) freezeInitiatedByAnotherPeer=false -> This means that freeze has been initiated by this peer and it is this peer's responsibility to freeze the bucket from remote storage too.

b) removeRemote=1 -> This means that we are removing the bucket from remote storage.

The corresponding entry in audit.log on the peer for removing the bucket from remote storage and evicting the bucket locally -

05-21-2019 02:24:01.301 +0000 INFO  AuditLogger - Audit:[timestamp=05-21-2019 02:24:01.301, user=n/a, action=remote_bucket_remove, info=completed, cache_id="bid|_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC|", receipt_id=trek/_internal/db/b1/e 697~9EA978AA-B109-473E-A526-F09AEE391FCC/receipt.json, prefix=trek/_internal/db/b1/ee/697~9EA978AA-B109-473E-A526-F09AEE391FCC][n/a]
05-21-2019 02:24:02.230 +0000 INFO  AuditLogger - Audit:[timestamp=05-21-2019 02:24:02.230, user=n/a, action=local_bucket_evict, info=completed, cache_id="bid|_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC|", kb=15, elapsed_ms=1, files="strings_data,sourcetypes_data,sources_data,hosts_data,lex,tsidx,bloomfilter,journal_gz,deletes,other"][n/a]

vi). Peer now sends a freeze notification back to CM. Below line is logged on CM and YY.Y.YY.YYY is the IP of the peer which sent the request to CM.

YY.Y.YY.YYY - splunk-system-user [21/May/2019:02:24:01.363 +0000] "POST /services/cluster/master/control/control//freeze-buckets/?output_mode=json HTTP/1.1" 200 305 - - - 0ms

Now, CM will remove the bid from its memory for this peer.

05-21-2019 02:24:01.363 +0000 INFO CMMaster - event=removeBucket remove bucket bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC from peer=65579980-63C7-4ABF-89F1-1E12D852E3F4 peer_name=TEST.com:8089 frozen=true
05-21-2019 02:24:01.363 +0000 INFO CMBucket - Freezing bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC on peer=TEST.com:8089 guid=65579980-63C7-4ABF-89F1-1E12D852E3F4 peer_name=TEST.com:8089
05-21-2019 02:24:01.363 +0000 INFO CMPeer - removing bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC from peer=65579980-63C7-4ABF-89F1-1E12D852E3F4 peer_name=TEST.com:8089

vii). At last CM will send a notification to other peers to remove the bucket at their end by running batch_bucket_frozen job which posts the http request to peers as mentioned in 3.vi)

05-21-2019 02:24:02.204 +0000 INFO CMRepJob - running job=batch_bucket_frozen guid=08D702A6-9EA2-4634-B07B-7D30D32DD87C hp=TEST2.com:8089 _internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC

POST request on peer from CM in splunkd_access.log:

XX.X.XX.XXX - splunk-system-user [21/May/2019:02:24:02.211 +0000] "POST /services/cluster/slave/control/control/freeze_buckets_by_list?output_mode=json HTTP/1.1" 200 322 - - - 1ms

Where XX.X.XX.XXX is IP of CM .

Peer finally removes the bucket from its local storage.

05-21-2019 02:24:02.212 +0000 INFO  DatabaseDirectoryManager - cid="bid|_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC|" found to be on remote storage
05-21-2019 02:24:02.212 +0000 INFO  BucketMover - Will freeze bkt=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC path='/opt/splunk/var/lib/splunk/_internaldb/db/db_1555290488_1554859220_697_9EA978AA-B109-473E-A526-F09AEE391FCC'
05-21-2019 02:24:02.212 +0000 INFO  BucketMover - RemoteStorageAsyncFreezer trying to freeze bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC, freezeInitiatedByAnotherPeer=true
05-21-2019 02:24:02.212 +0000 INFO  CacheManager - Evicted cachedId=bid|_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC| freed_space=0 reason= manual earliest_time=1554859220 latest_time=1555290488
05-21-2019 02:24:02.212 +0000 INFO  CacheManager - will remove cacheId="bid|_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC|" removeRemote=0
05-21-2019 02:24:02.220 +0000 INFO  CMSlave - deleteBucket bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC, frozen=true
05-21-2019 02:24:02.220 +0000 INFO  BucketMover - RemoteStorageAsyncFreezer freeze completed succesfully for bid=_internal~697~9EA978AA-B109-473E-A526-F09AEE391FCC

Since this is the peer which received the request from CM to freeze/remove the bucket locally, hence we have freezeInitiatedByAnotherPeer=true and removeRemote=0.

Path Finder

Great answer @rbal_splunk

Definitely took time to answer this in depth.

Appreciated,

0 Karma