We want data in our indexes from the last week to be easily searchable, and data after that to be searchable with a little extra work. The best way to do this seems to be to make sure data less than a week old is in hot/warm buckets, and older than that in cold. It looks like splunk has a way to force data from hot to warm (maxHotSpanSecs) and cold to frozen (frozenTimePeriodinSecs), but nothing for warm to cold.
I've seen people recommend approximating the amount of data indexed per day, configuring maxDataSize to that amount, and then setting maxWarmDBCount to the number of days we want in hot/warm (7), but that seems too much like guesswork. If we get an extra large number of data one week, we won't have 7 days in hot/warm. And if our indexing volume changes over time (which seems inevitable), we'd consistently have to change the numbers.
Could I do this? (it would be scripted):
1. shut down splunk
2. move files older than 1 week from index/db/db_* to index/colddb/*
3. start splunk
Is there a better way?
We endeavor to do the same thing. We use volume storage and homePath.maxDataSizeMB on each of our indexes (we have about 20) to accomplish this. We do have to keep an eye on things and re-calculate periodically.
[volume:local]
path = /splunk_warm
maxVolumeDataSizeMB = 863411
[logsource1]
homePath = volume:local/logsource1/db
homePath.maxDataSizeMB = 2500
coldPath = volume:remote/logsource1/colddb
thawedPath = /splunk_warm/thaweddb
We have the volume limit so that we don't kill the disk, but also have all the individual index homePaths so that their total doesn't exceed the volume limit.
@gkanapathy, there is good reason for this. Our most voluminous logs are not the ones we search most often, while other logs are smaller, but we search daily, often for "All Time", doing incident research. We like to roll the large, less searched logs to cold more quickly, since that's a slower storage medium. Likewise, our most searched log we'd like to keep on warm storage longer than average. So, generally, we think of it as searches "more frequently", average, or "less frequently" and try to adapt where our buckets are stored (warm versus cold) appropriately. It would be nice if there were an easier way to configure this (e.g. based on time,
I see this post is about 5yrs old. but is there a way to do this now? I'd like to move the data from buckets based on time period.
There is really not much point in doing this. There is no difference at all in how "hard" it is to search for data in hot/warm vs cold. Functionally, they are identical. The only difference is that they may be stored on different volumes.
Presumably, hot/warm is faster than cold, so you would normally want to put as much as you can on that volume, and only when it is full do you move that data to cold. There's very little point to moving the data to cold just because it has reached a certain age, if there is still space on hot/warm.
Therefore, we have no facility for specifying the transition by time, only by the amount of space. The assumption is that you want to keep as much as possible on the hot/warm volume and only move to cold when the limits of its space is exhausted.
gkanapathy's point was there isn't much point in forcing this to happen prior to your filesystem filling.
"There is really not much point in doing this. There is no difference at all in how "hard" it is to search for data in hot/warm vs cold."
I disagree. Since cold storage is most often slower I/O than warm, searching cold is slower, especially if you're returning results, as those results take longer to read from disk.
Well, with respect to the point about specifying a time range that includes them.
Ah lovely. No wonder I've been having so many problems...
The docs are wrong for version 4.x and up. They were correct for 3.x.
The Splunk docs say data in a cold bucket is searchable "only when the search specifies a time range included in these files", which seems to indicate there is a difference.
Storing them on a different volume is an additional advantage for us; the hot/warm buckets will be stored on better drives.
I do see your point, but we'd prefer to organize the data based on time. Since nearly every search specifies the time, we can know ahead of time relatively how responsive a search will be based on the time of the data we're searching for.
We move data from warm to cold based on quantity, oldest buckets first. You can get the behavior you need via this 'guesswork'. There's some effort towards doing this more explicitly in 4.2.