Getting Data In

What if sensitive data has to be aged by time rather than its bucket size?

mathiask
Communicator

Hi I know the linked Question is quite similar but does not answer everything (I think).. also maybe since then something might have changed
http://answers.splunk.com/answers/36861/indexes-tiering-age-based-not-depend-from-size.html

According to the Documentation (index and cluster administration)
Hot bucket:
- Data is written to it > don't backup
- maxDataSize defines when the bucket gets moved to warm and a new hot bucket is started, also defines the general bucket size
- maxHotBuckets: default 3, min 2, number of concurrent hot buckets

Warm bucket
- maxWarmDBCount: number of concurrent warm buckets > oldest one rolls to cold

Cold bucket
- generally like warm bucket without a limit

Frozen
- by default deleted, can be archived, non searchable
- cold bucket that get either rolled because they hit frozenTimePeriodInSecs or maxTotalDataSizeMB
- frozenTimePeriodInSecs: index data time out ~ 6 years default
- maxTotalDataSizeMB: index max size

So to the questions
1. Is there a difference between warm and cold besides the bucket limit? Or is it mainly for storage reasons like warm is kept on a faster access and cold is less used and therefore moved to cheaper storage?
2. Is the ageing of the buckets really mainly defined by the maxDataSize? So in case of an index with a lot of data the buckets age faster and in case where are less logs the buckets age slower, or completely erratic when the data is "bursty"? Is there any possibility to roll buckets by time rather than size and count?
3. What happens in the case of a low data index where the data ages slower, but the frozenTimePeriodInSecs is very low ... like 24h or even less? Does this then also affect warm buckets if there is data older than 24h? What happens if there is only a hot bucket?

The potential use case is to remove sensitive data automatically as soon as it is not needed anymore.

Thanks for your help

1 Solution

jrodman
Splunk Employee
Splunk Employee

Question 1 - Warm vs Cold is just to allow for split storage, nothing more. Typically fast vs cheap but it could be ad-hoc for "some storage" + "some more storage" as people grow.

Question 2 - maxDataSize is the ceiling size for a bucket. That's it. It thus influences transitioning from hot state to warm state, or "when we stop putting data into the bucket", which is really what that means.
If you're controlling warm by count, that indirectly influences when things leave warm for cold, but I think most users these days would rather control warm by size (homePath.maxDataSizeMB) or put the whole thing in a volume and have that control by size.

There are no controls to roll from warm to cold (swap filesystems) based on time. I can see why you might want to, but usually volumes achieves the same goal, since it will roll out the oldest bucket of everything it contains based on age.

Question 3 - If an index contains a non-hot bucket where all data in the bucket is older than frozenTimePeriodInSecs then it is frozen (by default, deleted. modify with coldToFrozenDir / Script). Thus a low value like 24 hours (24 * 60 * 60) will typically cause a bucket to live for 24 hours after rolling to warm, assuming the data arriving into the bucket was data for now. If the bucket was receiving only old data, it might vanish immediately on rolling. If the bucket is receiving future data, it could live for quite a while, depending upon how future. Obviously future data is usually a misconfiguration or misbehavior.

View solution in original post

jrodman
Splunk Employee
Splunk Employee

Question 1 - Warm vs Cold is just to allow for split storage, nothing more. Typically fast vs cheap but it could be ad-hoc for "some storage" + "some more storage" as people grow.

Question 2 - maxDataSize is the ceiling size for a bucket. That's it. It thus influences transitioning from hot state to warm state, or "when we stop putting data into the bucket", which is really what that means.
If you're controlling warm by count, that indirectly influences when things leave warm for cold, but I think most users these days would rather control warm by size (homePath.maxDataSizeMB) or put the whole thing in a volume and have that control by size.

There are no controls to roll from warm to cold (swap filesystems) based on time. I can see why you might want to, but usually volumes achieves the same goal, since it will roll out the oldest bucket of everything it contains based on age.

Question 3 - If an index contains a non-hot bucket where all data in the bucket is older than frozenTimePeriodInSecs then it is frozen (by default, deleted. modify with coldToFrozenDir / Script). Thus a low value like 24 hours (24 * 60 * 60) will typically cause a bucket to live for 24 hours after rolling to warm, assuming the data arriving into the bucket was data for now. If the bucket was receiving only old data, it might vanish immediately on rolling. If the bucket is receiving future data, it could live for quite a while, depending upon how future. Obviously future data is usually a misconfiguration or misbehavior.

mathiask
Communicator

Thanks again for the answers.

I perfectly understand the reasoning behind the size based approach and the potential implications.
For us it is very important that we can provide an automated process ensures that we stay compliant to certain agreements.

I implemented the config and works as desired.

0 Karma

mathiask
Communicator

Thank you so much ... awesome answers
so the configs are mainly focused on the data size since this is of the administration POV the main issue
Time based deletion is possible even if it is not the normal use case, with answer 3 a time based rolling warm > cold is not needed anyway.
Thank you for also pointing out the potential "misconfigurations"

0 Karma

jrodman
Splunk Employee
Splunk Employee

Time based deletion is a common desire, and is handled first class, though as Herr Mueller says, it can be tricky if you have very exacting requirements for exactly when the data must be fully gone.

Moving from warm to cold based on time is a much less common goal. There's only so many gigabytes/terabytes/exabytes of fast storage, and giving priority to one data category over another based on size usually gives you a clear behavior, while doing it based on time might be fairly unclear how the resource ends up allocated. If you need to do that, it requires futzing around or an ER.

martin_mueller
SplunkTrust
SplunkTrust

When dealing with sensitive, must-delete-soon data you probably want to set maxHotSpanSecs to a value smaller than your must-delete-by deadline. Using 24 hours as an example, forcing Splunk to roll a hot bucket to warm once it has a span of two hours means it'd be deleted after 26 hours even if low volume didn't trigger the maxDataSize yet.

Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...