Slow Indexer Problems

Jarohnimo · ‎06-28-2017

I have a 2 TB Indexer 12 CPUs, 12GBs of memory. We didn't get a chance to have a say in the storage teir and i imagine we have the slowest storage immaginable. Definitely not the 800 IOPS splunk requires.

I'm noticing 20 second spike all the time with my indexqueue and also spikes at times with my parsingqueue

Can someone tell me if these problems could cause data loss or things such as alerts not firing off? I'm noticing now that some of the results i look for in splunk are not there even though splunk is monitoring those logs. it takes a LONG time for splunk to show that the info is in there.

I'm pretty sure the answer is to get Splunk on faster Storage... i just would like someone to explain the queues thanks.

Jarohnimo · ‎06-29-2017

Roughly 160gb a day. Definitely don't want to adjust the queue if just fixing the storage is the real issue

I'd imagine the current queue would fair well with 3000 iops vs the 80 iops we have now according to splunk

somesoni2 · ‎06-28-2017

Quick facts: When data comes to Splunk from forwarders, it goes through various queues to handle different index time operations (amazing explanation here http://docs.splunk.com/Documentation/Splunk/6.6.1/Indexer/Howindexingworks). Spikes in queue means the operation there is not efficient.

Jarohnimo · ‎06-28-2017

I ran a report to show me the queues and I see some queues go up to 90. I assume this is 90 seconds wait time... Horrid!!

Anyone with experience can tell me if this level of latency on the disk can cause alerts to never fire off?

I noticed after I reboot my box in .. then it will fire off alerts... For a lil while...

somesoni2 · ‎06-28-2017

With slower disk, there will be latency for data to become searchable. Till it becomes searchable, it's not picked up by your alert searches and they don't fire off (or falsely fireoff depending upon your alert conditions). Since it works after you restart Splunk/server, I would look into more efficient event parsing configurations, as well as, ensure that people are writing efficient searches (with bad searches, you've lesser resources available for alerts, which have lower priority than adhoc searches BTW, and they'll get skipped).

Jarohnimo · ‎06-28-2017

Thanks

Definitely make sense. For that reason I guess it would make sense to isolate your alerts and scripts on one box and another box for ad-hoc searches?
To me the overarching issue I have is index queues. It's very high. Parsing queue sometimes but always index queue. We have very slow storage it seems. At one point to it worked great. It seems the more alerts and searches we set up. The more stuff doesn't work and now this index queue seems to be the biggest issue.

I have a 3 server setup .

1 indexer 12 CPUs 24gbs of ram slow disk for 2 TB

2 search heads 16cpus 12 GBS of ram.. disk whatever... Lol

This is my set-up. We only have 5 or 6 People searching most of our load is from ITSI or scheduled searches and alerts.. perhaps we are stressing out system too much with the amount to of searches or the effeciency but I truly believe if we had good iops on our indexer it could process these requests much faster and alerts would fire off easily

ddrillic · ‎06-28-2017

Please keep in mind that the default queue sizes are tiny. If we look at part of the default server.conf you see very small queues -

[queue=WEVT]
maxSize = 5MB
# look back time in minutes
cntr_1_lookback_time = 60s
cntr_2_lookback_time = 600s
cntr_3_lookback_time = 900s
# sampling frequency is the same for all the counters of a particular queue
# and defaults to 1 sec
sampling_interval = 1s

[queue=aggQueue]
maxSize = 1MB
# look back time in minutes
cntr_1_lookback_time = 60s
cntr_2_lookback_time = 600s
cntr_3_lookback_time = 900s
# sampling frequency is the same for all the counters of a particular queue
# and defaults to 1 sec
sampling_interval = 1s

[queue=parsingQueue]
maxSize = 6MB
# look back time in minutes
cntr_1_lookback_time = 60s
cntr_2_lookback_time = 600s
cntr_3_lookback_time = 900s
# sampling frequency is the same for all the counters of a particular queue
# and defaults to 1 sec
sampling_interval = 1s

[queue=vixQueue]
maxSize = 8MB

After many iterations of indexer crashes, we ended up with the following in local -

[queue=AEQ]
maxSize = 200MB

[queue=parsingQueue]
# Default maxSize = 6MB
maxSize = 3600MB

[queue=indexQueue]
maxSize = 4000MB

[queue=typingQueue]
maxSize = 2100MB

[queue=aggQueue]
# Default maxSize = 1MB
maxSize = 3500MB

[diskUsage]
minFreeSpace = 2000

So, parsingQueue moved from 6MB to 3600MB !!!! Interestingly, it's our responsibility to do that.

gjanders · ‎06-28-2017

Did you find much difference pre/post tuning?
I've seen minimal difference with queue size changes.

In terms of what the original question was about, how much data per day is been ingested ?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

ddrillic · ‎06-29-2017

In our case, the difference was between a system which constantly crashed and one that was perfectly stable.

Slow Indexer Problems

Splunk Decoded: Service Maps vs Service Analyzer Tree View vs Flow Maps

What’s New in Splunk Observability – September 2025

Fun with Regular Expression - multiples of nine

Are you a member of the Splunk Community?

Slow Indexer Problems

Splunk Decoded: Service Maps vs Service Analyzer Tree View vs Flow Maps

What’s New in Splunk Observability – September 2025

Fun with Regular Expression - multiples of nine