We are intending to input about 35GB/day into Splunk enterprise. That can easily be handled by a single "reference" hardware indexer, even with some searching. However, our retention time is very long in comparison to what I've read on most use cases. We intend to have 1 year of hot (12TB/yr) and 2-3 years of cold (24-36TB)
I've read that there is significant overhead in just indexing and holding large amounts of data, without even searching it.
Is there a formula for how many "reference" indexers are required (per TB) to just hold data?
Note that we do not intend to search it often, it is mainly intended for incidence response. Of course we need to consider the searching (and are looking at SSD for the hot), but that would be on top of what we need just to hold it.
Overhead in indexing means the initial write to memory (hot) and then the roll from HOT buckets to WARM buckets to COLD, along with associated read activity once written do disk...
There is also overhead in regards to searching that data, e.g., IOPS required to read across your disks. Currently we recommend having 900 IOPS available for pure indexing environments, this includes bucket rolling searching etc. 3 years isnt really that long. In certain spaces, 7 years is the defacto standard, but thats not live searchable, its archived and has to be thawed.
So here is a better question, what is your search requirements? Does your definition of 'hot' and 'cold' mean that you have to have 1 year of searchable, and then 2 - 3 years of archived data that can be searched when requested? Or does this mean 3 years of searchable from the beginning?
If you have a 1 year searchable requirement, then its straightforward. You have a 365 day retention policy on your index(es) and after that freeze your buckets and offload them to slow/cheap storage. Then when someone needs to search those previous 2 - 3 years, they would need to have the data thawed(restored) and then its searchable.
If you have a full 3 year search requirement, then its better to figure out your primary search cases (are your searches really earliest=-1y@y) or are they 180/90/60 days? You can keep that data on your hot/warm volumes, which are faster, and after that roll them to cold / slower volumes.
If you do this, it is important to understand that slower spindles equal slower search results. Setting user expectations is important here. (A -15minute search vs a -15month search will take the same amount of time to finish..)
Thanks for the quick replies! Sorry I didn't supply all the details first.
The exact requirements are 1 year quick(er) response (warm) to searches and another 2-3 years of slow(er) response to searches (cold). Another 5-6 years of frozen that has to be thawed before it can be used. So yes, we are expected to be able to search up to 4 years without thawing.
The searchable raw data sets will be 12TB warm, 24-36TB cold. Using the sizing tool, the actual estimated used physical space will be 6-7TB warm, and 2-3x that for the cold. I'm expecting to use either SSD or fast spinning for the warm, and dedicated NFS for the Cold and frozen. ... assuming fast NFS is good enough for cold, given light searching.
I've read that SSD (for the warm space) would be great for the sparse searches, it would depend on if its cost effective if we don't search much. its not supposed to be an every day tool, but used in response to incidents. A few ad hoc searches (of the entire warm space) need to be completed in hours not days. Searches in the cold space that take days (not weeks) are acceptable as well. (...Of course that is until they discover the usefulness of the tool and change their minds on how much they use it.... 😉
So, if I'm clear on the overhead for indexing alone in our use case;
1) We have to write 35GB/day (raw) to memory (hot).
2) It is compressed and indexed to ~18GB/day and written to warm.
3) Data that is 1 year old (~18GB/day) has to be found, read and re-written/re-indexed to cold.
4) Data that is 4 years old (~18GB/day) has to be found, frozen and moved to frozen space.
So both the warm and cold space have to write (including deletes) 36GB/day each. This seems trivial in our case because its a smaller daily data set.
I'm thinking the real overhead is searching the indices for the oldest data to move? I'm assuming that happens multiple times a day, Thus the bigger the space we scan, the more work is required. Will 900 IOPs cover it for either warm or cold, given a 20TB cold space and a 7TB warm space? Will a single "reference" hardware indexer handle the CPU requirements? Will it have enough left over to handle light/slow searches?
I have a side question. If someone is focusing their search on 3 year old data, it wont be moved into warm space automatically?
I'm hoping not as that would bump the newer data to cold for a short term gain.
Thanks again for all your thoughts.
To address a few points..
Side Question -- Answer
Splunk will not move data between Hot/Warm -> Cold -> Frozen based upon the dispatched searches. This done based on retention policies, and with Frozen, with a manual thawing process.
Regarding SSD, in general testing, SSDs deliver better performance in general. More specifically they deliver better performance for sparse searches.
Regarding NFS.. Typically this is good for Frozen and Cold. But anything else, you dont want to go there.
In regards to aging data out, remember Splunk knows whats in the buckets (timestamps / Sourcetypes etc..) So the rolling of buckets based on retention time shouldnt be a huge cpu hit, more disk and controller.
Other then that, its hard to predict search-ability without knowing the search types your users will be doing. Typically in incident response, you are limiting time ranges within a few weeks or days of the known event. So historical searching shouldnt be "all time" searches, but most likely smaller windows. If this is the case, it should be pretty good for performance as long as the deliverable IOPs are their and the SH isnt overloaded from user space knowledge objects and searches...
Mileage will vary..