Security

Splunk Fw suddenly stopped

Nawab
Communicator

We have an environment where Splunk UF sends logs to HF and mostly UFs are stuck even HF and indexers are up, we need to restart the UFs to again send logs. Why uf are stuck even if indexer or HF is not available. CPU and RAM utilization is normal on server.

Labels (1)
0 Karma
1 Solution

kiran_panchavat
SplunkTrust
SplunkTrust

@Nawab 

These are the 4 main scenarios I would imagine in a simple forwarder-receiver topology:

A. forwarder is crashing, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data.
B. forwarder is gracefully shut down, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data.
C. forwarder is crashing, but has been able to forward data to the receiver so far: persistent queue data will be preserved on disk, however in-memory data is very likely to be lost.
D. forwarder is gracefully shut down, but has been able to forward data to the receiver so far: both persistent queue and in-memory data will be forwarded (and indexed) before the forwarder is fully shut-down.

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

View solution in original post

0 Karma

Nawab
Communicator

I have identified that aggqueue and tcpout_Default_autolb_group queue is having most issue which addregator process and one sourcetype have most cpu utilization, no how can i fix this

0 Karma

Nawab
Communicator

@kiran_panchavat , I checked this my queues are full but my question is when qeues are back to normal why some Ufs are not back and we need to restart the service

0 Karma

kiran_panchavat
SplunkTrust
SplunkTrust

@Nawab 

These are the 4 main scenarios I would imagine in a simple forwarder-receiver topology:

A. forwarder is crashing, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data.
B. forwarder is gracefully shut down, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data.
C. forwarder is crashing, but has been able to forward data to the receiver so far: persistent queue data will be preserved on disk, however in-memory data is very likely to be lost.
D. forwarder is gracefully shut down, but has been able to forward data to the receiver so far: both persistent queue and in-memory data will be forwarded (and indexed) before the forwarder is fully shut-down.

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
0 Karma

kiran_panchavat
SplunkTrust
SplunkTrust

@Nawab 

Probably it contains something which broke the data pipeline. You should start with the next documents to understanding what can cause this issue:

https://docs.splunk.com/Documentation/Splunk/latest/Deploy/Datapipeline 

https://conf.splunk.com/files/2019/slides/FN1570.pdf 

https://docs.splunk.com/Documentation/Splunk/latest/DMC/IndexingDeployment 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
0 Karma

kiran_panchavat
SplunkTrust
SplunkTrust

@Nawab 

Useful Pipeline Searches with metrics.log:- 

How much time is Splunk spending within each pipeline?

index=_internal source=*metrics.log* group=pipeline | timechart sum(cpu_seconds) by name

How much time is Splunk spending within each processor?

index=_internal source=*metrics.log* group=pipeline | timechart sum(cpu_seconds) by processor

What is the 95th percentile of measured queue size?

index=_internal source=*metrics.log* group=queue | timechart perc95(current_size) by name

 

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
0 Karma

kiran_panchavat
SplunkTrust
SplunkTrust

@Nawab 

metrics.log*:- 

group=queue displays the data to be processed
current_size can identify which queues are the bottlenecks
blocked=true indicates a busy pipeline

Checking metrics.log across the topology reveals the whole picture. An occasional queue filling up does
not indicate an issue. It becomes an issue when it remains full and starts to block other queues.

index=_internal source=*metrics.log host=<your-hostname> group IN(pipeline, queue)


02-23-2019 01:08:43.802 +0000 INFO Metrics - group=queue, name=indexqueue, blocked=true,
max_size_kb=500, current_size_kb=499, current_size=968, largest_size=968, smallest_size=968


02-23-2019 01:10:39.802 +0000 INFO Metrics - group=pipeline, name=typing, processor=sendout,
cpu_seconds=0.05710199999999998, executes=134716, cumulative_hits=1180897

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
0 Karma

kiran_panchavat
SplunkTrust
SplunkTrust

@Nawab

Ensure there are no network connectivity problems between the UFs and the HFs. Sometimes, intermittent network issues can cause the UFs to get stuck. Check the queue size on the UFs. If the queue is full, the UF might stop processing new logs until there is space available. Even though you mentioned that CPU and RAM utilization is normal, it might be worth checking if there are any spikes or unusual patterns in resource usage.If the HF is overloaded, it might not be able to process logs from the UFs efficiently.

Please check the queues on the UF and Heavy Forwarder (HF), as they are likely reaching capacity. Consider increasing the pipeline. Verify the metrics.log on the UF &  Heavy Forwarder to see if any queues are getting blocked. You can find the log at:

cat /opt/splunk/var/log/splunk/metrics.log | grep -i "blocked=true"

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
0 Karma
Get Updates on the Splunk Community!

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey

There’s something special about this time of year—maybe it’s the glow of the holidays, maybe it’s the ...

Announcing the Migration of the Splunk Add-on for Microsoft Azure Inputs to ...

Announcing the Migration of the Splunk Add-on for Microsoft Azure Inputs to Officially Supported Splunk ...

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI! Discover how Splunk’s agentic AI ...