Deployment Architecture

Evaluation of storage and retention policy for cluster with 6 indexers and 'hundreds' of indexes

BlueSparrow
New Member

I am working on setting up a third party evaluation of a new network management and security monitoring installation for an enterprise network that uses Splunk for various log aggregation purposes. The environment has 6 indexers with duplication across 3 sites, and hundreds of indexes set up and configured by the installers.

The questions that I need to write a test for: "Is there sufficient storage available for compliance with data retention policies? (e.g. is there sufficient storage available to meet 5 year retention guidelines for audit logs?)" I would like to run simple search strings to produce the necessary data tables. I am no wizard at writing the appropriate queries, and I don't have access to an environment that is complicated enough to try these things out before I have limited time on the production environment to run my reports. After reading through the forums for hours, it seems like answering this storage question may be harder than originally anticipated, as Splunk does not seem to have any default awareness of how much on disk space it is actually consuming.  

1. Research has shown that I need to make sure that the age off and size cap for each index is appropriately set with the FrozenTimePeriodInSecs and maxTotalDataSizeMB variables in each index.conf file. Is there a search I can run that will provide a simple table for all indexes across the environment with these two variables? e.g. index name, server, FrozenTimePeriodInSecs, maxTotalDataSizeMB

2. Is there any other configuration where allocated space is determined for an index that can be returned with a search?  

3.  Is there a search string I can run to show the current storage consumption (size on disk) for all indexes on all servers? I have seen some options here on the forums and I think the answer for this one might be the following: 

 

| dbinspect index=* | eval sizeOnDiskGB=sizeOnDiskMB/1024 | eval rawSizeGB=rawSize/1024 | stats sum(rawSizeGB) AS rawTotalGB, sum(sizeOnDiskGB) AS sizeOnDiskTotalGB BY index, splunk_server

 

 

4. What is the best search string to determine the average daily ingest "size on disk" by index and server/indexer to calculate required storage needed for retention policy purposes? So far, I have found something like this:

index="_internal" source="*metrics.log" per_index_thruput source="/opt/splunk/var/log/splunk/metrics.log" 
| eval gb=kb/1024/1024
| timechart span=1d sum(gb) as "Total Per Day" by series useother=f 
| fields - VALUE_*

I'm not sure quite what is happening above with the useother=f or the last line of the search. the thread I found it on is dead enough that I don't expect a reply. 

I would need any/all results from these three searches in table format sorted by index, server to match up with the other searches for simple compilation.

Any help that can be provided is greatly appreciated.

Labels (3)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

From the end 😉

useother=f is an option to the timechart command which along with the "limit" parameter changes how the timechart works. The timechart command has a default limit of 10 (or is it 20?) series. So if you do a timechart by some field, it will generate separate time series only up to the limit. All remaining data will (or will not, depending on whether the useother parameter is set to true or false)z be aggregated into a single series called "other". Seting limit to 0 causes timechart to generate separate series for each value of the field you're splitting your data by regardless of its cardinality.

And your main problem is - on the one hand relatively easy because your Splunk instance has some license limitations (I asume that if you have 3 separate sites they didn't go for workload-based licensing) so you have the upper limit of daily ingested data.

Unfortunately, it's never that simple.

1. Noone said that your ingestion will be symmetrical. At least you didn't say so. So it might skew the distribution, depending on the replication settings.

2. Raw data is one thing but there might be additional factors - how well is your data compressing? (That you can calculats from the dbinspect outout) Do you use many indexed fields? Do you use datamodel acceleration? If so, are the accelerated summaries stored on the sme storage as the raw data buckets or on another volume? Do you use volumes? Do the volumes have size limits which could be reached?

Generally speaking you'd need to process the dbinspect output and/or introspection rest endpoints output. Doing that over a cluster is generally not that different from running it on your all-in-one lab under your desk with the exception of possible multiple copies of the same bucket spread across indexers and possible replicated but not searchable buckets.

Oh, and BTW, if I see that there are "hundreds" of indexes I begin to wonder what is the reason. Typical two reasons for splitting data into separate indexes is access control (you grant access on a per-index basis) and retention settings. There is also an issue with cardinality of your data so you might want to separate sources that log 10 events per day from those logging several millions daily. But that's it.

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...