Monitoring Splunk

Put Data in Separate Index Based on Timestamp

Communicator

We have Splunk as part of our default vm image but we're having some bucket issues. Initially, the time isn't set and is generally very far in the past. Splunk starts indexing data with incorrect timestamps which creates long spanning buckets which kills performance.

I found this entry on the wiki but the link is dead and I can't find anything that shows how to use transforms based on timestamps. All I can see is a regex field which would be very difficult to construct for example if I wanted everything older than 30 days to go to a separate index.

http://www.splunk.com/wiki/Deploy:UnderstandingBuckets

"If you have to load up a bunch of archive data, Splunk recommends that you create a separate index for it. Refer to this topic in the Administration Guide for information on doing this. You can specify a regex to force all data with timestamps older (or newer) than a given time range to be placed in an alternate index. "

The dead link in question here is http://www.splunk.com/base/Documentation/latest/admin/RouteEventToIndex. I have found plenty of related pages on the documentation but none use timestamps and I can't find anything in transforms.conf that takes a time parameter other than lookups.

1 Solution

Splunk Employee
Splunk Employee

This is not really possible, and in Splunk 4.0 is not necessary, as this is handled automatically within an index.

In Splunk 4.0, each index can have multiple hot buckets, and data will be placed into a hot bucket according to where it fits in time. The exact time differential is determined by a combination of index settings (in indexes.conf) and automatic determinations by Splunk (which in turn is determined by the timestamp ranges of incoming data).

In Splunk 4.0, the default hot bucket has a maximum time range of 90 days, but in most uses, Splunk will keep buckets to a much narrower range than that. Furthermore, there are quarantine buckets that are intended to capture all data older than or farther in the future than some other settings, to keep it from polluting the index too badly (e.g., all data from 7 years ago and all data from 10 years ago will go into the same quarantine bucket, even though they are more than 90 days apart).

Generally, you do not need to adjust these settings, provided you use the same index settings as the main index, which can have up to 10 hot buckets. The effect is that your bucket spans will stay narrow and you will not get overlapping buckets (except under extremely pathological conditions, like your data sending in timestamps from 11 different times that are each spaced out more than 90 days from each other -- this usually only happens if you're loading archived data, which calls for more control and a separate index anyway)

View solution in original post

Splunk Employee
Splunk Employee

This is not really possible, and in Splunk 4.0 is not necessary, as this is handled automatically within an index.

In Splunk 4.0, each index can have multiple hot buckets, and data will be placed into a hot bucket according to where it fits in time. The exact time differential is determined by a combination of index settings (in indexes.conf) and automatic determinations by Splunk (which in turn is determined by the timestamp ranges of incoming data).

In Splunk 4.0, the default hot bucket has a maximum time range of 90 days, but in most uses, Splunk will keep buckets to a much narrower range than that. Furthermore, there are quarantine buckets that are intended to capture all data older than or farther in the future than some other settings, to keep it from polluting the index too badly (e.g., all data from 7 years ago and all data from 10 years ago will go into the same quarantine bucket, even though they are more than 90 days apart).

Generally, you do not need to adjust these settings, provided you use the same index settings as the main index, which can have up to 10 hot buckets. The effect is that your bucket spans will stay narrow and you will not get overlapping buckets (except under extremely pathological conditions, like your data sending in timestamps from 11 different times that are each spaced out more than 90 days from each other -- this usually only happens if you're loading archived data, which calls for more control and a separate index anyway)

View solution in original post