We're currently running Splunk Enterprise on AWS EC2 as a single instance deployment. We have ~ 10,000 forwarders pushing ~ 90G (daily) of logs to the instance and indexing ~13G of logs. We are experiencing a problem in the indexing pipeline that causes Splunk to drop below our acceptable latency thresholds and never recover.
It's clear that we're dropping over three times the log volume that we index and we've identified that the bottleneck in the pipeline is clearly during the pre-indexing phase. I'm looking for a recommendation on an expanded infrastructure that would handle this kind of volume indefinitely and reduce our single points of failure, if possible.
I can't change the configuration on the forwarders, they're pre-baked on the systems. I can restart them if absolutely necessary.
This is a production system used by our on-call engineers and downtime for the system leaves us blind to some customer problems.
My research suggests that multiple heavy forwarders handling the data transformations behind a round-robin DNS entry and pushing the transformed but not indexed logs to the current instance would be the suggested architecture. We could then convert the indexer to a clustered setup and break out the search head. My research suggests that the forwarders keep the IP they're forwarding in memory so they'd need to be restarted or they'll continue to send logs directly to the indexer.
Can anyone confirm/deny this approach will work? Are there other options I should consider?
As a quick measure you could configure a second indexing pipeline and, if necessary to sustain the second pipeline, add more cores to your EC2 box.
In your monitoring console, indexing performance view, cpu per splunk processor panel - what processor is using the most?
What version are you on?
Do you have a deployment server the forwarders are talking to?
If not, did you bake an IP or a hostname into the forwarders' outputs.conf?
13G/day indexed, 90G/day incoming is usually handleable with one indexer. 10k UFs can present an issue wrt handling all those connections.
You mentioned SPOF - in splunk terms, that means indexer cluster.
I'd stand up two or three 24+core 64GB indexers in a replicating cluster, with a smaller-sized master to orchestrate them and a 12-24-core 32-64GB search head, depending on what you do with the data.
I wouldn't introduce a HF layer, spend the resources on getting more and/or bigger indexers instead.
All numbers are guesstimates because I don't know your data, use cases, requirements, etc.
Since you're already in AWS, have you considered splunk cloud? Let cloud ops worry about handling your 10k forwarders 😄
The aggregator can, in most cases, be optimized away entirely.
Instead of using
SHOULD_LINEMERGE=true for your sourcetypes and letting splunk merge based on timestamps, set a
LINE_BREAKER that breaks your events correctly in the first place and then turn off linemerge - that way your aggregator doesn't need to do anything at all.
If you're not using the
punct field, you can turn off
annotate_punct to take some load off the annotator.
Regexreplacement is what does your nullqueue filtering, it's expected to take some significant load. Depending on your data and the regexes you use there may or may not be room for improvement.
What do the processors, not the queues, say?
How many indexing pipelines are you using?
Why are you restricted to only 32 cores?
Are you using all 32 cores already?
You can upgrade your Splunk server without upgrading the forwarders, but 6.6.x is pretty new already - for example, 6.6.x includes multiple pipeline sets.
We're running 1 indexing queue, so adding a few more looks like a way we can maximize throughput and provide a fast lane for time-sensitive data.
For second spent per Index processor activity, we're spending significantly more time writing to disk than indexing.
With regard to CPU usage per Processor, the aggregator is by far the biggest, second would be the annotator and regexreplacement, and then occasional (once a week?) heavy usage by the header.
I've got a stop-gap measure in place that keeps things above our latency thresholds for now, and we're at the max number of cores available to us in EC2, running 32 cores and 60G mem.
As for version we're on 6.6.3, and can't upgrade to 7 because we won't upgrade the old forwarders on existing systems. (lack of willingness, not lack of technical means)
We do not have a deployment server and are baking the hostname in the outputs.conf.
Before the stop-gap, the Parsing, Merging, and Typing pipelines were at 100%, indexing at 0%