Splunk Dev

Why is the parsingQueue blocking on only one server?

sochsenbein
Communicator

Out of 19 windows servers running the same services, there is one server that keeps on blocking at parsingQueue. I have increased the size to 30MB while the others remain under 10MB, but it keeps on blocking.

I ran the following code to check how many events hit each server and found that they are all even:

index=_internal host=<server_name>* group=queue name=parsingqueue | timechart span=60m limit=0 count by host

Next, I ran this search to check the size of the queue and found that while the rest of the servers are at about ~1000, the server that is blocking is above 70K!

index=_internal host=<server_name>* group=queue name=parsingqueue | timechart span=60m limit=0 sum(current_size) by host

They are all running with the same system specs: 64-bit, 3.07GHz, 12 core, 6 cores per CPU and 96 gigs of RAM. They all have plenty of disk space, as well. Is there a way to check the increase/output rate for the queue? Also, I am not sure how far to increase before it becomes dangerous. Is there anything else I have missed?

TIA,

Skyler

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

To use a metaphor someone else used recently on slack, when your toilet is backing up, you do not increase the size of the bathroom... you find out why it is plugged. Glad you're here to check on that, and please accept my appreciation of your excellent write-up, which covered the first three things that I'd look at.

In fact, two of the suggestions I'm about to make are basically saying, look one level deeper at the same place you've already looked...

Here are some triage steps that I would be taking ...

1) Verify that indexer has the EXACT same configuration as all the others. (You've listed that, but literally, think of every setting you can and check them all.) Specifically, don't look at just the entire disk drive, but look at the volumes specifically allocated to splunk.

2) Search and see if there is some source, host, sourcetype, etc that is aiming at that indexer and not at any other indexer (this is the FIRST thing I'd check.) So, we're not talking just how many events, but what KIND of events...

3) Check the total number of bytes being indexed by indexer, and see if that one is way off from the others. If it is NOT, then number 2 above MUST apply... something is sending preferentially to that indexer. If it IS, then you know it is something regarding the indexer itself. Figuring out what may be as simple as (aircode)

 | tstats count as countbyserver where index=* by sourcetype splunk_server 
 | eventstats sum(countbyserver) as countbysourcetype  by sourcetype
 | where splunk_server="mybadboy"
 | eval ratio=(100.00*countbyserver)/countbysourcetype
 | where ratio > 10 

That should give you a list of sourcetypes where that bad boy is getting more than 10% of the traffic for the sourcetype. If none of those, then do one that calcs percentage of traffic from each sending host.

4) If all else fails, take that indexer off line and see if any other indexer suddenly exhibits the same behavior.

Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...