Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Deployment Architecture

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community
- :
- Splunk Answers
- :
- Splunk Administration
- :
- Deployment Architecture
- :
- Estimating index storage requirements?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

rturk

Builder

09-02-2012
06:28 AM

Hi Splunkers,

I've been doing some design documents for a fairly large distributed deployment of Splunk:

- 100GB/license

- 2 geographically separated sets of 2 VM indexers (w/ direct attached storage)

- 90 days retention required

I'm now up to the point of estimating the amount of storage I need to give to each of the Indexers (assuming the load is shared evenly among them), however I've come upon a bit of a contradiction in the doco:

From Hardware Capacity Planning for your Splunk Deployment:

At a high level, total storage is

calculated as follows:

daily average rate x retention policy x 1/2

So, given my specs above:

```
100GB x 90 days X 1/2 = 4.5TB total storage required between 4 indexers = 1.125TB/Indexer
```

BUT, from Estimate your storage requirements:

Typically, the compressed rawdata file

is approximately 10% the size of the

incoming, pre-indexed raw data. The

associated index files range in size

from approximately 10% to 110% of the

rawdata file.

So, given my same specs:

```
100GB/day x 90 days = 9TB total raw data to be indexed
(9TB x 10%) + ((9TB x 10%) x 110%) = 1,890GB total storage between 4 indexers = 472.5GB/Indexer
```

Have I missed something here? Which recommendation am I meant to run with?

1 Solution

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

_d_

Splunk Employee

09-02-2012
08:21 AM

Yes, you have 🙂 and the usage of the word "index" is the reason you're being mislead in this case.

When raw data is indexed, for each bucket, at a minimum, we store:

- an index structure that is associated with it (think of the index at the end of each book)
- a compressed file which contains the actual raw data (this is where your events are stored).

So the math goes like this:

index size = (index structure) + (compressed raw data) = 1/2 (size of uncompressed raw data)

Given your specs, this is what you should use to calculate:

`100GB x 90 days X 1/2 = 4.5TB total storage required between 4 indexers = 1.125TB/Indexer`

Hope this helps,

d.

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Estimating index storage requirements?

rturk

Builder

09-02-2012
05:30 PM

*d*, I was probably going to err on the side of caution anyway, but this is the answer I was looking for cheers 🙂

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Estimating index storage requirements?

lzhang_soliton

Path Finder

03-14-2013
11:04 PM

*d*,

I am looking for the math to know the change of the size between before and after indexing. Could you point me how you get the math?

I calculate the size according to the document as R.Turk wrote.

raw data size: 9TB

"rawdata file size": 9TB x 10%

Minimum index size: (9TB x 10%) + ((9TB x 10%) x 10%)

Maximum index size: (9TB x 10%) + ((9TB x 10%) x 110%)

Thank you in advance.

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Estimating index storage requirements?

theunf

Path Finder

07-24-2014
07:43 PM

I think you all missed the point of what replication_factor was used here and, maybe, if multisite cluster replication were used.

If the second option where not used and replication factor was 4, each log will reside on each node, so each node will need 4.5Tb of disk.

Same scenario with a replication factor of 2: If only 1 onde recieve logs, 2 nodes will use 4.5Tb and the other 2 nodes will recieve nothing. The replication pair selection is automatic and you may see it by searching on _audit or at settings->indexes at each node.