I am using Splunk Cloud environment. I am interested to know how many buckets created for an index and what will be default size of a bucket.
Issue in my environment:
We have on-boarded some log files into Splunk couple of months back but the timestamp of those logs show as older date from the year 2016. The log format contains time but not the date. When I reffered below link, Splunk should automatically apply the date and it will match mostly with system time. But having an event with 4 years old date is incorrect.
Will that bucket name gets created with the actual earliest date time present in the bucket or based on the first event which is present in the bucket?
Example: New hot buckets created on 05th Jan 2020, so, it contain the first event as 01/05/2020 01:00:00:345. But due to incorrect time stamp assignment as I explained above, it has an event with 04/03/2016 01:00:00:211 (4 years old timestamp). Now, when it roll to cold bucket, what will be the name?
Will it be dbFeb10th2020Jan5th2020_ or dbFeb10th2020Apr3rd2016_
I don't believe Splunk cloud gives you control over your bucket size, but the default Splunk Core bucket size is 10GB or 750MB for 64/32 bit systems respectively. Its probably a good guess that Splunk Cloud is using the same values, although they may be doing some tweaking behind the scenes?
The number of buckets you have will therefore be dictated (mainly) by how much data you index. If you index 40Gb, you probably have around 4 buckets.
Splunk uses a number of techniques to assign a date to an event where a log file only contains times (not dates) as your first link describes.
I am assuming (hoping) that Splunk was able to use one of those methods to correctly establish the year for your old data
If Splunk can infer the date either from the filename, or the modification date of the file it will set the date correctly.
If Splunk can not figure out the date - it will assume the date is "today" (meaning the date it was indexed on)
It sounds like in your case it was successful?
The date of a bucket, or (its warm/cold/frozen) name is determined when it rolls from hot > warm.
The name of the bucket is determined by both the oldest and newest events that it contains.
This means that if you have a bucket that is created (rolled) on 05Jan2020 it is "probable" (but not certain) that the last event in the bucket is from that date.
To work out what the name needs to be, Splunk checks in that bucket for the oldest event (based on the event time) and the newest.
If you have a bucket that includes really old data, as well as very recent data, it will have a long timespan - eg: db_"newestTime"_"oldestTime"_"ID"
So (in human readable terms) db_05Jan2020_03Apr2016_ID
This means that data in that bucket will not roll to frozen, until the last event (newest) has met the frozenPeriod.
For this reason it is possible that you can have data which is "older" than your configured frozen period still searchable.