Splunk Search

Slow Indexer Problems

Jarohnimo
Builder

I have a 2 TB Indexer 12 CPUs, 12GBs of memory. We didn't get a chance to have a say in the storage teir and i imagine we have the slowest storage immaginable. Definitely not the 800 IOPS splunk requires.

I'm noticing 20 second spike all the time with my indexqueue and also spikes at times with my parsingqueue

Can someone tell me if these problems could cause data loss or things such as alerts not firing off? I'm noticing now that some of the results i look for in splunk are not there even though splunk is monitoring those logs. it takes a LONG time for splunk to show that the info is in there.

I'm pretty sure the answer is to get Splunk on faster Storage... i just would like someone to explain the queues thanks.

0 Karma

Jarohnimo
Builder

Roughly 160gb a day. Definitely don't want to adjust the queue if just fixing the storage is the real issue

I'd imagine the current queue would fair well with 3000 iops vs the 80 iops we have now according to splunk

0 Karma

somesoni2
SplunkTrust
SplunkTrust

Quick facts: When data comes to Splunk from forwarders, it goes through various queues to handle different index time operations (amazing explanation here http://docs.splunk.com/Documentation/Splunk/6.6.1/Indexer/Howindexingworks). Spikes in queue means the operation there is not efficient.

0 Karma

Jarohnimo
Builder

I ran a report to show me the queues and I see some queues go up to 90. I assume this is 90 seconds wait time... Horrid!!

Anyone with experience can tell me if this level of latency on the disk can cause alerts to never fire off?

I noticed after I reboot my box in .. then it will fire off alerts... For a lil while...

0 Karma

somesoni2
SplunkTrust
SplunkTrust

With slower disk, there will be latency for data to become searchable. Till it becomes searchable, it's not picked up by your alert searches and they don't fire off (or falsely fireoff depending upon your alert conditions). Since it works after you restart Splunk/server, I would look into more efficient event parsing configurations, as well as, ensure that people are writing efficient searches (with bad searches, you've lesser resources available for alerts, which have lower priority than adhoc searches BTW, and they'll get skipped).

0 Karma

Jarohnimo
Builder

Thanks

Definitely make sense. For that reason I guess it would make sense to isolate your alerts and scripts on one box and another box for ad-hoc searches?
To me the overarching issue I have is index queues. It's very high. Parsing queue sometimes but always index queue. We have very slow storage it seems. At one point to it worked great. It seems the more alerts and searches we set up. The more stuff doesn't work and now this index queue seems to be the biggest issue.

I have a 3 server setup .

1 indexer 12 CPUs 24gbs of ram slow disk for 2 TB

2 search heads 16cpus 12 GBS of ram.. disk whatever... Lol

This is my set-up. We only have 5 or 6 People searching most of our load is from ITSI or scheduled searches and alerts.. perhaps we are stressing out system too much with the amount to of searches or the effeciency but I truly believe if we had good iops on our indexer it could process these requests much faster and alerts would fire off easily

0 Karma

ddrillic
Ultra Champion

Please keep in mind that the default queue sizes are tiny. If we look at part of the default server.conf you see very small queues -

[queue=WEVT]
maxSize = 5MB
# look back time in minutes
cntr_1_lookback_time = 60s
cntr_2_lookback_time = 600s
cntr_3_lookback_time = 900s
# sampling frequency is the same for all the counters of a particular queue
# and defaults to 1 sec
sampling_interval = 1s

[queue=aggQueue]
maxSize = 1MB
# look back time in minutes
cntr_1_lookback_time = 60s
cntr_2_lookback_time = 600s
cntr_3_lookback_time = 900s
# sampling frequency is the same for all the counters of a particular queue
# and defaults to 1 sec
sampling_interval = 1s

[queue=parsingQueue]
maxSize = 6MB
# look back time in minutes
cntr_1_lookback_time = 60s
cntr_2_lookback_time = 600s
cntr_3_lookback_time = 900s
# sampling frequency is the same for all the counters of a particular queue
# and defaults to 1 sec
sampling_interval = 1s

[queue=vixQueue]
maxSize = 8MB

After many iterations of indexer crashes, we ended up with the following in local -

[queue=AEQ]
maxSize = 200MB

[queue=parsingQueue]
# Default maxSize = 6MB
maxSize = 3600MB

[queue=indexQueue]
maxSize = 4000MB

[queue=typingQueue]
maxSize = 2100MB

[queue=aggQueue]
# Default maxSize = 1MB
maxSize = 3500MB

[diskUsage]
minFreeSpace = 2000

So, parsingQueue moved from 6MB to 3600MB !!!! Interestingly, it's our responsibility to do that.

0 Karma

gjanders
SplunkTrust
SplunkTrust

Did you find much difference pre/post tuning?
I've seen minimal difference with queue size changes.

In terms of what the original question was about, how much data per day is been ingested ?

0 Karma

ddrillic
Ultra Champion

In our case, the difference was between a system which constantly crashed and one that was perfectly stable.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...