We are currently performing a POC using Splunk 4.1.3 to index Blue Coat proxy data. Our test Splunk license is for 200gig a day. I have indexed around 800gig worth of raw data on our POC server.
The Splunk documentation @ http://www.splunk.com/base/Documentation/latest/Installation/HowHowmuchspaceyouwillneed gives some commands to run in order to get a idea of how much storage you will need. I ran the commands and see the following numbers, I am only using the default DB:
[root@ssdatacrusher2 db]# du -shc hot_v*/rawdata 3.2G hot_v1_22/rawdata 31M hot_v1_25/rawdata 3.9G hot_v1_35/rawdata 2.7G hot_v1_38/rawdata 9.8G total [root@ssdatacrusher2 db]# du -ch hot_v* 4.0K hot_v1_22/splunk_optimize_logs 4.0K hot_v1_22/rawdata/.compressedAddresses 3.2G hot_v1_22/rawdata 8.0G hot_v1_22 0 hot_v1_22.sentinel 4.0K hot_v1_25/splunk_optimize_logs 4.0K hot_v1_25/rawdata/.compressedAddresses 31M hot_v1_25/rawdata 62M hot_v1_25 0 hot_v1_25.sentinel 4.0K hot_v1_35/splunk_optimize_logs 4.0K hot_v1_35/rawdata/.compressedAddresses 3.9G hot_v1_35/rawdata 9.5G hot_v1_35 0 hot_v1_35.sentinel 4.0K hot_v1_38/splunk_optimize_logs 4.0K hot_v1_38/rawdata/.compressedAddresses 2.8G hot_v1_38/rawdata 7.3G hot_v1_38 0 hot_v1_38.sentinel 25G total
I however ran a du -h on the $SPLUNK_HOME/var directory and am seeing around 400gig worth of disk usage:
406G ./var 406G . [root@ssdatacrusher2 splunk]# pwd /opt/splunk
I am seeing most of the 400gig disk usage being used by directories that end with rawdata.
When the data is rolled from hot to warm will these directories be deleted and cause my disk usage to do down or stay around the same since I will still be indexing more raw data ?
Following on from gkanapathy -
When Splunk moves data from the Hot DB to the Warm DB, nothing is deleted - it is simply moved
When Splunk moves data from the Warm DB to the Cold DB, nothing is deleted - it is simply moved
When Splunk "retires" data from the Cold DB, it will be deleted unless you have configured a
coldToFrozenScript in indexes.conf. This is done as part of a larger exercise to configure your data retirement policy, click here to learn more about this subject.
A follow on page from the above link is this one, which will tell you how to set up your script, and some other options you may want to consider.
The main thing to understand here are the states through which your data moves as Splunk is indexing it and as it ages. You can't keep data in the warm DB forever unless you have a lot of space or are indexing very little, so you need to consider how much space you want to use and how often you want to access it. If you usually run searches on data from the last week or two, then that's all you really need to keep in the hot & warm DB's and you can move your cold DB off to a cheap NFS location somewhere. Searching data on a NFS location, is slower than local disk, so if you're going to be regularly searching over data from the last 3 - 6 months and speed is important to you, then you will want to size your Splunk server accordingly and give it a lot of local storage.
RAID 5 is not a good storage solution for Splunk, as it is slow, and Splunk can only work as fast as the storage volume can, so if you're interested in fast performance, don't go down that route.
Another consideration is 100GB of data doesn't necessarily equate to 100GB of disk-space when indexed. The compression ratio depends on the data, but you could find that 100GB equates to 50GB of space within
$SPLUNK_DB. The only way to know for sure, is to run some tests with known data volumes.
Also remember that cheap NFS storage is perfectly acceptable for older data you don't search very often
Thanks for the answers. This will definitely help me size our Splunk environment a little better. I am being asked to keep 6 months worth of this data. If my math is right ( 100gig a day after indexing * 180 days ) I am looking at around 18TB worth of data.
With the servers that I am planning on purchasing, I would be able to get around 8.4TB of local disk using RAID 10 striping. If I move those servers to RAID 5 I would be able to get around 14TB. What are your thoughts on using RAID 5 for Splunk index servers that are setup as a sinkhole and collecting and indexing files every hour ?
No, this data will not be deleted, and it not meant to be deleted (unless it ages out by policy). The commands listed are not giving you the actual size of the total data, they are intended to give you some idea of the ratio of raw data size vs stored disk size.
I would also say that the commands in the documentation are not correct or useful the way you have used them. I'd pretty much ignore them, as the results they gave you above don't say much.
The best thing for you is to just find out how much space is taken up by /opt/splunk/var/lib/splunk and compare that to the raw amount of data you have indexed. If you started with an empty index and the 800GB of raw is the only thing put into that index, then that will give you an indication of the size ratio you can expect. My guess is that it's pretty close to 400GB/800GB = 50%, i.e., the index will need about 50% of the raw size, assuming the data sample is representative of all your data.
According to http://www.splunk.com/base/Documentation/4.1.4/Admin/Backupindexeddata, it looks like hot buckets are renamed into warm. Warm to cold is renamed or moved, depending on if they are on the same filesystem.