- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have an environment where Splunk UF sends logs to HF and mostly UFs are stuck even HF and indexers are up, we need to restart the UFs to again send logs. Why uf are stuck even if indexer or HF is not available. CPU and RAM utilization is normal on server.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
These are the 4 main scenarios I would imagine in a simple forwarder-receiver topology:
A. forwarder is crashing, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data. B. forwarder is gracefully shut down, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data. C. forwarder is crashing, but has been able to forward data to the receiver so far: persistent queue data will be preserved on disk, however in-memory data is very likely to be lost. D. forwarder is gracefully shut down, but has been able to forward data to the receiver so far: both persistent queue and in-memory data will be forwarded (and indexed) before the forwarder is fully shut-down.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have identified that aggqueue and tcpout_Default_autolb_group queue is having most issue which addregator process and one sourcetype have most cpu utilization, no how can i fix this
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@kiran_panchavat , I checked this my queues are full but my question is when qeues are back to normal why some Ufs are not back and we need to restart the service
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
These are the 4 main scenarios I would imagine in a simple forwarder-receiver topology:
A. forwarder is crashing, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data. B. forwarder is gracefully shut down, while it is unable to forward data to the receiver (regardless if it's due to unreachable receiver, network issues or incorrect/missing outputs.conf or alike): in-memory data will not be moved into the persistent queue, even if the persistent queue still has got enough space to accomodate the in-memory queue data. C. forwarder is crashing, but has been able to forward data to the receiver so far: persistent queue data will be preserved on disk, however in-memory data is very likely to be lost. D. forwarder is gracefully shut down, but has been able to forward data to the receiver so far: both persistent queue and in-memory data will be forwarded (and indexed) before the forwarder is fully shut-down.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Probably it contains something which broke the data pipeline. You should start with the next documents to understanding what can cause this issue:
https://docs.splunk.com/Documentation/Splunk/latest/Deploy/Datapipeline
https://conf.splunk.com/files/2019/slides/FN1570.pdf
https://docs.splunk.com/Documentation/Splunk/latest/DMC/IndexingDeployment
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Useful Pipeline Searches with metrics.log:-
How much time is Splunk spending within each pipeline?
index=_internal source=*metrics.log* group=pipeline | timechart sum(cpu_seconds) by name
How much time is Splunk spending within each processor?
index=_internal source=*metrics.log* group=pipeline | timechart sum(cpu_seconds) by processor
What is the 95th percentile of measured queue size?
index=_internal source=*metrics.log* group=queue | timechart perc95(current_size) by name
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
metrics.log*:-
group=queue displays the data to be processed
current_size can identify which queues are the bottlenecks
blocked=true indicates a busy pipeline
Checking metrics.log across the topology reveals the whole picture. An occasional queue filling up does
not indicate an issue. It becomes an issue when it remains full and starts to block other queues.
index=_internal source=*metrics.log host=<your-hostname> group IN(pipeline, queue)
02-23-2019 01:08:43.802 +0000 INFO Metrics - group=queue, name=indexqueue, blocked=true,
max_size_kb=500, current_size_kb=499, current_size=968, largest_size=968, smallest_size=968
02-23-2019 01:10:39.802 +0000 INFO Metrics - group=pipeline, name=typing, processor=sendout,
cpu_seconds=0.05710199999999998, executes=134716, cumulative_hits=1180897
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ensure there are no network connectivity problems between the UFs and the HFs. Sometimes, intermittent network issues can cause the UFs to get stuck. Check the queue size on the UFs. If the queue is full, the UF might stop processing new logs until there is space available. Even though you mentioned that CPU and RAM utilization is normal, it might be worth checking if there are any spikes or unusual patterns in resource usage.If the HF is overloaded, it might not be able to process logs from the UFs efficiently.
Please check the queues on the UF and Heavy Forwarder (HF), as they are likely reaching capacity. Consider increasing the pipeline. Verify the metrics.log on the UF & Heavy Forwarder to see if any queues are getting blocked. You can find the log at:
cat /opt/splunk/var/log/splunk/metrics.log | grep -i "blocked=true"
