How much RAM do I need to avoid I/O problems?

gabriel_vasseur · ‎06-01-2017

I have noticed a correlation between RAM usage and I/O on my indexers. Whenever RAM usage goes around or above 50%, I/O usage goes crazy. My understanding is it's because spare RAM is used by the OS as I/O cache. We want to add more RAM to the indexers to help but I have two questions:

1) how much RAM is too much? We're thinking of doubling it, but would there be much value for performance and/or future-proofing in quadrupling it instead? (note: we're using Enterprise Security quite extensively)

2) can anybody give me recent official-looking references backing this up? This is required to justify the cost to people who might say "why do you need more memory, you're only using half at the moment?". All I have so far is this and that ancient articles, as well as somebody saying "I'm not from splunk so I can say it: you need as much RAM as you can afford" in the "It Seemed Like a Good Idea at the Time...Architectural Anti-Patterns" talk from conf 2016.

Thank you!

Bselberg · ‎10-04-2019

First I’m going to state nothing I have here is scientific and I can’t give exact details publically on any exact use case. However I can say we have looked into this extensively. I’ll give short answers for those looking at this as the answer should last until Splunk completely reworks the way searching and indexing is handled.

How much RAM is too much? -The amount of RAM Splunk the service needs to run can be up to 128GB+ if searches have enough in memory operations or you have adjusted Indexing pipeline queues to be larger to smooth out indexing bursts.

Realistically though it just seems to soft cap around 45-80GB (it grows with server uptime 50+ days. You should patch and reboot more frequently than this) + search traffic active ram is at most around 100-300MB/search so if you have 100 searches for concurrency it can be 30+Gb of ram (make sure you have the core counts to back this up). I assume this also scales by ingestion volume and if you make use of more than one indexer pipeline.
Now the more important one is OS caching of paged files. This is RAM that isn’t active but paged by the OS. (think windows pre fetch super fetch and just file mapping in RAMmap) There is no limit to this if IO is a concern. The larger this space is the more the OS will hold onto. The less Disk IO for writes will be interrupted by the search reads.

2.As for official publically available documentation I cannot provide this but I can provide how 1.a is valid to those who read this in the future.

Using reference https://docs.splunk.com/Documentation/Splunk/7.3.1/Capacity/HowsearchtypesaffectSplunkEnterpriseperf... you can note that “Super-sparse” & “Rare” searches are throttled by I/O. 
Next evaluate your searches: Key things you are looking for is the _Time ranges for 80-90% of your searches that are sparse. This matters on an index by index basis. What is the avg look back distance for these sparse searches? Is it 1 hour 1 day or 1 week or 1 month? For this example we will use 1 week as the majority of look back times for searches/alerts. We will assume all index 
  This means all bucket data with “_time” values within 1 week of now should at least be sitting in your “Hot”/“Warm” storage minimally if your cold storage tier is slower. If all storage tiers are the same speed IO then Hot/warm/cold doesn’t matter. Only OS ram caching.

Using a query for internal index since everyone has this index:

    | dbinspect index=_internal timeformat=%s
| eval lookback=relative_time(now(), "-1w@w")
| where  startEpoch < lookback
| stats count as totalBuckets  min(startEpoch) as oldestEvent sum(rawSize) as rawSize,sum(sizeOnDiskMB) as diskSizeGB by index , state
| convert ctime(startEpoch) 
| convert ctime(oldestEvent)

The values you are looking for in this is the time range of about 30 days. This will then tell you "diskSizeGB" in Hot Warm and Cold.
This number here + about 20% is what you are looking for when across all indexes to be available for the OS to cache. Beyond this number is a waste of ram as you are not running a Seek/read operation.
This is only helpful is you have sparse searches. If your alerting is all dense searches your CPU will be 100% long before the IO subsystem is being the bottleneck. This complex calculation is why they don't publish exact sizing guides. One alteration to a search pattern and a user will drastically alter the Ram usage profile for cache OS storage.

You can test this on a single indexer that is windows using RAMmap and clearing the standby list after is has your search window for sparse data. then run the search. monitor the host IO profile. Run the same search 1-5 times. Check under the mapped files to ensure they are cached then run the search again and watch as there should be 5-15% of the original IO profile. In an IO bound system bound by searches that are sparse additional ram can cut search times by 40+%. But the level of tuning required to get these search times is fleeting since your not in control of the OS caching. Nix* use "Htop + free" to watch similar usage patterns.

ddrillic · ‎06-04-2017

Btw, setting the indexers' queues with the proper amount of memory is crucial.

lguinn2 · ‎06-04-2017

Take a look at the documentation, especially if you need a source to quote: Reference Hardware in the Capacity Planning manual is a good starting point. Reading (or at least browsing) the whole manual is a good idea...

In the recommendations for indexers, you will find the following memory sizes:

Basic: 12 GB RAM
Mid-range: 64 GB RAM
High performance: 128 GB RAM

Finally, I contend that caching is a perfectly valid use of memory if it improves overall performance!

DalJeanis · ‎06-04-2017

See there, you told him exactly which M to RTF.

DalJeanis · ‎06-04-2017

You should contact your splunk customer service team, ie your sales rep. You aren't trying to buy anything right now, but they are there to make sure you love the product, and I would bet that they have MILES of use cases and Excel spreadsheets and whatnot to accurately answer your questions. More importantly, they will be able to ask YOU the questions that are significant to ensuring the answer be well-fit to your particular situation and projections.

(And, by the way, if they help you do this right, then your users will be served well, usage of splunk will rise in your business and the splunk company will eventually make more money: win win win.)

I would bet they would ask you a few usage questions, like, How many simultaneous users do you expect: total, how many on each search head? At what rate will this grow? How many servers, what architecture, clustered, replicated, etc? What version of splunk?
What kind of data are you searching through? What percentage of the searches are of underlying detail? What percentage of the searches are accelerated?

And then there are more advanced nudges, like, How well have you reviewed your data models against the usage? How are you making sure that your users know they are not supposed to boil the ocean? How are you making sure they know HOW not to boil the ocean?

Anyway, call splunk for support, and they'll give you the ammo you need to make your decision right.

DalJeanis · ‎06-04-2017

By the way, "using half your RAM" isn't a thing. It's all being used for something.

Also, "too much RAM" isn't a thing either. I remember back in the dark ages, when the Internet was made up of two tin cans and a piece of string, and a PC came out with 256K RAM. I wondered and joked with my programmer friends why the PC designers were being so ridiculous and how they could ever use up all that space...

However, with regard to right-sizing your RAM, chat with customer support and compare the cost/benefit of more RAM with the cost/benefit of more servers. I'd expect if there's any doubt which one to do, you should do the servers.

How much RAM do I need to avoid I/O problems?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!