On the indexers we have 64 GBs of RAM.
We have the following configurations -
[queue=AEQ]
maxSize = 200MB
[queue=parsingQueue]
maxSize = 3600MB
[queue=indexQueue]
maxSize = 4000MB
[queue=typingQueue]
maxSize = 2100MB
[queue=aggQueue]
maxSize = 3500MB
So, the processing queues can consume altogether up to 13.4 GBs and currently we are at 100% for all the queues. We wonder how high we can set them up while leaving enough RAM for the Splunk processes.
The servers are fully dedicated to Splunk...
Make your parsing more efficient by explicitly setting timestamp and linemerge in the props.conf on the indexers
example:
TIME_FORMAT = %d.%m.%Y %H:%M:%S.%3N
TIME_PREFIX = ^
MAX_TIMESTAMP_LOOKAHEAD = 23
LINE_BREAKER = ([\r\n]+)(?:\d{2}.\d{2}.\d{4}\s\d{2}:\d{2}:\d{2}.\d{3}\s*\w+*)
SHOULD_LINEMERGE = false
That you're at 100% for all queues suggests a fundamental problem with your architecture. It implies that you're not able to write to storage as fast as data is coming in.
You need either faster storage, or better distribution of your inputs, but not bigger queues. Queues are best used for dealing with unpredictable ingestion rates (they can handle volume spikes for you), but they cannot help you if your overall rate is overwhelming your throughput capacity.
-- ... best used for dealing with unpredictable ingestion rates...
That's pretty much it but really, it's not unpredictable ingestion rates as the ingestion rates vary greatly throughout the business day. Increasing the queues has been helping out in the past year or so, to handle the peak usage time.
So, the indexers have 64 GBs of RAM and the queues, at the moment, are up to 13.4 GBs. How high can they be?
I'm declining to answer your specific question on account of I think it's the wrong question to be asking in this case. You should really be looking at balancing that load across more indexers.
no worries ; -)
Just curious: is this based on a lab or production environment? Your queue size and fill ratio implies indexing latency of several minutes which I would already consider excessive. How much incoming data is each indexer handling and what problem are you trying to solve with more queue?
-- indexing latency of several minutes
This is just fine. We can handle indexing latency of several minutes - nobody will get hurt ...
We "just" want to pass the peak usage time safely.
I'm intrigued by your environment! Seems safe to say that you're getting your money's worth out of your servers 🙂
I would just point out that I've seen apps run indexers completely out of memory; I'm guessing you aren't using useACK
at these volumes so I'd be concerned about potential data loss. I was also going to comment that you're sacrificing your file system cache for queue, but you have so much churn I wonder how long you can keep the cache around.
If you're prioritizing indexing over search performance (considering that the latter benefits from large vfs cache), why not go to a nice round number like 50% of system memory, or 32 GB? We run default queue sizes and the largest splunkd indexer process I see at the moment uses less than 2 GB of physical memory. Most of these indexers have only 32 GB of total memory and they're solid. If something's going to burn you I think it will be a runaway search and/or excessive search concurrency, not the indexing process itself.
Thanks for bringing up this thought-provoking question!!
Interesting, so you are saying that the 13.4 GBs for the queues, can grow all the way to 32 GBs! wow. I wonder if any of this is documented... meaning, the proper usage of memory on the indexers.
Right we don't use useACK
as didn't want to add to the load are we are not that worried truly about data loss, at least for now.
We simply can't write fast enough to disk at peak usage time -
I'm thinking about doubling the size of the index queue to be of 8 GBs. Not sure about the proportions across the queues...
Oh yeah, on 64-bit I have no reason to believe that you're going to be arbitrarily limited on queue size. In my mind, there are three good reasons to keep large amounts of free memory:
32 GB is plenty of system memory to leave for Splunk itself; again, most of our indexers have that much total memory. Consider page 46 of the following presentation where memory utilization is measured during high-load indexing and search; Splunk indexing just doesn't seem to take a lot of memory:
https://conf.splunk.com/files/2016/slides/harnessing-performance-and-scalability-with-parallelizatio...
I probably sound like a broken record, but under "normal" conditions the best place for your system memory is the vfs cache. Consider this presentation starting at 29:30 for a discussion of how total system memory affects IOPS in a production environment:
http://conf.splunk.com/files/2016/recordings/it-seemed-like-a-good-idea-at-the-time-architectural-an...
If you're certain that you're I/O-bound, you have a non-zero search workload, and you're giving all your memory to queues, you might end up hitting your storage even harder than you are already (requiring more queue and ultimately not solving any problems). You're certain that you're not CPU-bound and wouldn't benefit from additional indexing pipelines?
This entire conversation assumes that there's a technical reason that you can't just let your forwarders block. Is your indexing latency that much worse if you just leave the default queue size in place? Is there data on the forwarders that will be lost if you don't forward it fast enough? Still fascinated by your situation 🙂