First I’m going to state nothing I have here is scientific and I can’t give exact details publically on any exact use case. However I can say we have looked into this extensively. I’ll give short answers for those looking at this as the answer should last until Splunk completely reworks the way searching and indexing is handled.
How much RAM is too much? -The amount of RAM Splunk the service needs to run can be up to 128GB+ if searches have enough in memory operations or you have adjusted Indexing pipeline queues to be larger to smooth out indexing bursts.
Realistically though it just seems to soft cap around 45-80GB (it grows with server uptime 50+ days. You should patch and reboot more frequently than this) + search traffic active ram is at most around 100-300MB/search so if you have 100 searches for concurrency it can be 30+Gb of ram (make sure you have the core counts to back this up). I assume this also scales by ingestion volume and if you make use of more than one indexer pipeline.
Now the more important one is OS caching of paged files. This is RAM that isn’t active but paged by the OS. (think windows pre fetch super fetch and just file mapping in RAMmap) There is no limit to this if IO is a concern. The larger this space is the more the OS will hold onto. The less Disk IO for writes will be interrupted by the search reads.
2.As for official publically available documentation I cannot provide this but I can provide how 1.a is valid to those who read this in the future.
Using reference https://docs.splunk.com/Documentation/Splunk/7.3.1/Capacity/HowsearchtypesaffectSplunkEnterpriseperformance you can note that “Super-sparse” & “Rare” searches are throttled by I/O.
Next evaluate your searches: Key things you are looking for is the _Time ranges for 80-90% of your searches that are sparse. This matters on an index by index basis. What is the avg look back distance for these sparse searches? Is it 1 hour 1 day or 1 week or 1 month? For this example we will use 1 week as the majority of look back times for searches/alerts. We will assume all index
This means all bucket data with “_time” values within 1 week of now should at least be sitting in your “Hot”/“Warm” storage minimally if your cold storage tier is slower. If all storage tiers are the same speed IO then Hot/warm/cold doesn’t matter. Only OS ram caching.
Using a query for internal index since everyone has this index:
| dbinspect index=_internal timeformat=%s
| eval lookback=relative_time(now(), "-1w@w")
| where startEpoch < lookback
| stats count as totalBuckets min(startEpoch) as oldestEvent sum(rawSize) as rawSize,sum(sizeOnDiskMB) as diskSizeGB by index , state
| convert ctime(startEpoch)
| convert ctime(oldestEvent)
The values you are looking for in this is the time range of about 30 days. This will then tell you "diskSizeGB" in Hot Warm and Cold.
This number here + about 20% is what you are looking for when across all indexes to be available for the OS to cache. Beyond this number is a waste of ram as you are not running a Seek/read operation.
This is only helpful is you have sparse searches. If your alerting is all dense searches your CPU will be 100% long before the IO subsystem is being the bottleneck. This complex calculation is why they don't publish exact sizing guides. One alteration to a search pattern and a user will drastically alter the Ram usage profile for cache OS storage.
You can test this on a single indexer that is windows using RAMmap and clearing the standby list after is has your search window for sparse data. then run the search. monitor the host IO profile. Run the same search 1-5 times. Check under the mapped files to ensure they are cached then run the search again and watch as there should be 5-15% of the original IO profile. In an IO bound system bound by searches that are sparse additional ram can cut search times by 40+%. But the level of tuning required to get these search times is fleeting since your not in control of the OS caching. Nix* use "Htop + free" to watch similar usage patterns.
... View more