Deployment Architecture

UniversalForwarder ParsingQueue filling up

oliverj
Communicator

I have been troubleshooting blocked queues, and been gradually eliminating them. My last step was to switch from a heavy forwarder to a universal forwarder, eliminating all processing activities from the forwarder. This helped a lot, but now on my universal forwarder I am getting blocked=true messages for my parsing queue. (In a ten minute period, about 75% of my parsingqueue messages from metrics.log are "blocked").
Log flow: [~150 UniversalForwarders] -> [Cental UniversalForwarder] -> [Indexer] with my "Central UF" being the problem child.
My indexer is showing no issues (all queues at 0).

My network is a 10Mb connection, and my throughput is showing ~10% used so it doesn't seem to be the network.
The CentralUF is passing about 15GiG of data a day at a steady rate.
I boosted the queue size up to 30MB, and I still get the same issue. (Confirmed my 30MB setting actually kicked in).
I have changed my limits.conf [thruput] to be: maxKBps = 0

Second question that I was unable to find answer for: Why is there a parsing queue on the UniversalForwarder, if only heavy forwarders actually do any parsing?

0 Karma

ddrillic
Ultra Champion

-- Second question that I was unable to find answer for: Why is there a parsing queue on the UniversalForwarder, if only heavy forwarders actually do any parsing?

It refers to the parsing queue of the indexer.

0 Karma

oliverj
Communicator

That part throws me off, because my indexer has no "blocks" on any queue for the past week.
The "indexing performance" gui in the DMC shows all pipelines @ 0% on my indexer.

0 Karma

micahkemp
Champion

Universal forwarders have limited throughput out of the box. From the documentation:

Universal and lightweight forwarders have a default thruput limit of 256Kbps. This default can be configured in limits.conf. The default value is correct for a forwarder with a low profile, indexing up to ~920 Mb/hour. But in the case of higher indexing volumes, or when the forwarder has to collect the historical logs after the first start, the default might be too low. This could delay the recent events.
0 Karma

oliverj
Communicator

Oh, I forgot to add that in my post.
I have also changed thruput from 265 to 0

0 Karma

micahkemp
Champion

Are any of the other queues on this forwarder filled as well? If not, the output throttling wouldn't seem to be the issue, as that would result in those downstream queues filling.

If only your parsing queue is filled, it could just be insufficient resources on the forwarder. What does your CPU and memory usage look like?

0 Karma

oliverj
Communicator

The only other item filling is my "splunktcpinput", and I assume that is a direct result of my parsing queue filling.

As far as resources are concerned:
My forwarder is @ 1gb ram / 4gb, and 1 - 2% of CPU on the single core used.

0 Karma

schandrasekar
Loves-to-Learn

I know this is very old post. I am seeing the same problem . Only name=execprocessorinternalq and parsingQueue is blocked and that too only for one forwarders . Other are working fine . Deployment is UF->HFs->IDXs
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=execprocessorinternalq, blocked=true, max_size_kb=500, current_size_kb=499, current_size=162, largest_size=162, smallest_size=162
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=fschangemanager_queue, max_size_kb=5120, current_size_kb=1, current_size=7, largest_size=7, smallest_size=7
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=httpinputq, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=indexqueue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=nullqueue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=10240, current_size_kb=10239, current_size=308, largest_size=308, smallest_size=308
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=splunktcpin, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=structuredparsingqueue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
05-20-2020 19:16:10.812 +1000 INFO Metrics - group=queue, name=tcpin_queue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0

0 Karma

oliverj
Communicator

I try to remember to post my resolutions after things get fixed, but unfortunately this one slipped through the cracks. It was a long time ago, and if I remember correctly, the fix was indirect. (Its been a while, but I THINK this is what happened)

  1. Main problem: Our indexer had MUCH too slow storage. (networked raid5, probably 300 iops shared with multiple VMs)
  2. Our network was not consistent, and latency was high (look up "long fat pipe"). Basically, didn't matter how "fast" the pipe was, the TCP round trip time was an artificial throttle.

To resolve:

  1. We worked with our users to reduce our logs (had to kick someone off for a while, they were accounting for 10 of the 15 gigs). This let us continue to collect the critical logs while we came up with a path forward.
  2. Our system was originally designed to handle ~5GB tops (about 5x more than the requirement!), and peaking at 15 gigs was waaaay outside scope. We got our hands on some money and purchased 2x dedicated servers (indexers) with SSD storage for the warm buckets

I feel that the main issue causing this whole thing was the slow storage on the indexers. Nothing was really reporting queues full except the forwarder, but reducing the incoming logs fixed it immediately, and upgrading the hardware we are able to push 20-30gb a day with no issues.

The slow network has since been upgraded (latency is same, pipe is fatter) but I am not certain that was ever the issue.

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...