Solved: Splunk Fw suddenly stopped

Nawab · ‎02-09-2025

We have an environment where Splunk UF sends logs to HF and mostly UFs are stuck even HF and indexers are up, we need to restart the UFs to again send logs. Why uf are stuck even if indexer or HF is not available. CPU and RAM utilization is normal on server.

kiran_panchavat · ‎02-09-2025

@Nawab

These are the 4 main scenarios I would imagine in a simple forwarder-receiver topology:

A. forwarder is crashing, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data.
B. forwarder is gracefully shut down, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data.
C. forwarder is crashing, but has been able to forward data to the receiver so far: persistent queue data will be preserved on disk, however in-memory data is very likely to be lost.
D. forwarder is gracefully shut down, but has been able to forward data to the receiver so far: both persistent queue and in-memory data will be forwarded (and indexed) before the forwarder is fully shut-down.

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

View solution in original post

Nawab · ‎02-09-2025

I have identified that aggqueue and tcpout_Default_autolb_group queue is having most issue which addregator process and one sourcetype have most cpu utilization, no how can i fix this

Nawab · ‎02-09-2025

@kiran_panchavat , I checked this my queues are full but my question is when qeues are back to normal why some Ufs are not back and we need to restart the service

kiran_panchavat · ‎02-09-2025

@Nawab

These are the 4 main scenarios I would imagine in a simple forwarder-receiver topology:

A. forwarder is crashing, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data.
B. forwarder is gracefully shut down, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data.
C. forwarder is crashing, but has been able to forward data to the receiver so far: persistent queue data will be preserved on disk, however in-memory data is very likely to be lost.
D. forwarder is gracefully shut down, but has been able to forward data to the receiver so far: both persistent queue and in-memory data will be forwarded (and indexed) before the forwarder is fully shut-down.

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

kiran_panchavat · ‎02-09-2025

@Nawab

Probably it contains something which broke the data pipeline. You should start with the next documents to understanding what can cause this issue:

https://docs.splunk.com/Documentation/Splunk/latest/Deploy/Datapipeline

https://conf.splunk.com/files/2019/slides/FN1570.pdf

https://docs.splunk.com/Documentation/Splunk/latest/DMC/IndexingDeployment

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

kiran_panchavat · ‎02-09-2025

@Nawab

Useful Pipeline Searches with metrics.log:-

How much time is Splunk spending within each pipeline?

index=_internal source=*metrics.log* group=pipeline | timechart sum(cpu_seconds) by name

How much time is Splunk spending within each processor?

index=_internal source=*metrics.log* group=pipeline | timechart sum(cpu_seconds) by processor

What is the 95th percentile of measured queue size?

index=_internal source=*metrics.log* group=queue | timechart perc95(current_size) by name

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

kiran_panchavat · ‎02-09-2025

@Nawab

metrics.log*:-

group=queue displays the data to be processed
current_size can identify which queues are the bottlenecks
blocked=true indicates a busy pipeline

Checking metrics.log across the topology reveals the whole picture. An occasional queue filling up does
not indicate an issue. It becomes an issue when it remains full and starts to block other queues.

index=_internal source=*metrics.log host=<your-hostname> group IN(pipeline, queue)

02-23-2019 01:08:43.802 +0000 INFO Metrics - group=queue, name=indexqueue, blocked=true,
max_size_kb=500, current_size_kb=499, current_size=968, largest_size=968, smallest_size=968

02-23-2019 01:10:39.802 +0000 INFO Metrics - group=pipeline, name=typing, processor=sendout,
cpu_seconds=0.05710199999999998, executes=134716, cumulative_hits=1180897

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

kiran_panchavat · ‎02-09-2025

@Nawab

Ensure there are no network connectivity problems between the UFs and the HFs. Sometimes, intermittent network issues can cause the UFs to get stuck. Check the queue size on the UFs. If the queue is full, the UF might stop processing new logs until there is space available. Even though you mentioned that CPU and RAM utilization is normal, it might be worth checking if there are any spikes or unusual patterns in resource usage.If the HF is overloaded, it might not be able to process logs from the UFs efficiently.

Please check the queues on the UF and Heavy Forwarder (HF), as they are likely reaching capacity. Consider increasing the pipeline. Verify the metrics.log on the UF & Heavy Forwarder to see if any queues are getting blocked. You can find the log at:

cat /opt/splunk/var/log/splunk/metrics.log | grep -i "blocked=true"

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

Splunk Fw suddenly stopped

audit

Holistic Visibility and Effective Alerting Across IT and OT Assets

SOC Modernization: How Automation and Splunk SOAR are Shaping the Next-Gen Security ...

Ask It, Fix It: Faster Investigations with AI Assistant in Observability Cloud