HF consuming almost full memory due to high pipeli...

Raghavsri

Our data flow is syslog server sending more number of data to one HF1, then its routing to a indexer cluster as well as to another HF2. from this another HF2, routing data to syslogNG and another indexer cluster, located in different environment

Due to high volume of data in our syslog server, we increased the pipeline queue size as 2500MB.

we faced backpressure in syslog and HFs , so vendor recommended to increase the pipeline size as 2500MB under server.conf , in both HFs and syslog server.

now the issue is HF2 consuming full memory(92GB) recently after the server reboot. after consume 100% memory , HF2 went hung . if we decrease the parallel pipeline from 2 to 1 in HF2, it create backpressure in syslog server and HF1 , and pipelines getting burst.

before the HF2 reboot, the memory consumption was less than 10GB only with pipeline size as 2500MB and Splunkd process was normal.

Note: so far HF1 not facing any memory(92GB) issue, located in between syslog server and HF2

now in this situation , increasing the memory in HF2 will be helpful ? or what will be best solution to overcome this scenario in future

livehybrid

Hi @Raghavsri

Increasing memory on HF2 may provide temporary relief but does not address the root cause. Excessive pipeline queue size (2500MB) can cause splunkd to consume large amounts of memory, especially if data flow is uneven or downstream components are slow. You also risk losing larger volumes of data if Splunk/system crashes because all the data in the queues will be lost.

Queues should really be used as a buffer, not to expand throughput.

I would suggest:

Reduce the pipeline queue size to a more conservative value (e.g., 512MB–1024MB) and monitor performance.
Investigate and resolve downstream bottlenecks (indexer or HF2 output) to prevent the backpressure.
Ensure that outputs.conf and syslog forwarding are optimized for throughput and reliability.
Consider load balancing or adding additional HFs to distribute the load, this should spread the load on the output and allow the HF to send out the data without back pressure.
As @gcusello mentioned, inefficient parsing could contribute to this issue - are you able to share more about what is happening with the parsing? Any examples of your props/transforms?

Ultimately, a large pipeline queue can mask underlying issues and lead to memory exhaustion. Memory upgrades alone will not prevent future hangs if the pipeline is oversized or downstream issues persist.

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

Raghavsri

in splunkd.log , of HF2 I could see these

06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_host_thruput, ingest_pipe=2, series="lmpsplablr001", kbps=4.247, eps=24.613, kb=131.668, ev=763, avg_age=2.279, max_age=3
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_index_thruput, ingest_pipe=2, series="_internal", kbps=2.206, eps=13.032, kb=68.396, ev=404, avg_age=2.233, max_age=3
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_index_thruput, ingest_pipe=2, series="_metrics", kbps=2.041, eps=11.581, kb=63.272, ev=359, avg_age=2.331, max_age=3
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_source_thruput, ingest_pipe=2, series="/mnt/splunk/splunk/var/log/splunk/audit.log", kbps=0.000, eps=0.032, kb=0.000, ev=1, avg_age=0.000, max_age=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_source_thruput, ingest_pipe=2, series="/mnt/splunk/splunk/var/log/splunk/metrics.log", kbps=4.082, eps=23.355, kb=126.545, ev=724, avg_age=2.312, max_age=3
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_source_thruput, ingest_pipe=2, series="/mnt/splunk/splunk/var/log/splunk/splunkd_access.log", kbps=0.165, eps=1.226, kb=5.123, ev=38, avg_age=1.711, max_age=3
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_sourcetype_thruput, ingest_pipe=2, series="splunk_audit", kbps=0.000, eps=0.032, kb=0.000, ev=1, avg_age=0.000, max_age=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_sourcetype_thruput, ingest_pipe=2, series="splunk_metrics_log", kbps=2.041, eps=11.677, kb=63.272, ev=362, avg_age=2.312, max_age=3
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_sourcetype_thruput, ingest_pipe=2, series="splunkd", kbps=2.041, eps=11.677, kb=63.272, ev=362, avg_age=2.312, max_age=3
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=per_sourcetype_thruput, ingest_pipe=2, series="splunkd_access", kbps=0.165, eps=1.226, kb=5.123, ev=38, avg_age=1.711, max_age=3
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=tcpout_my_syslog_group, max_size=512000, current_size=0, largest_size=0, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=tcpout_primary_indexers, max_size=512000, current_size=0, largest_size=219966, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=aggqueue, max_size_kb=2560000, current_size_kb=0, current_size=0, largest_size=260, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=indexqueue, max_size_kb=2560000, current_size_kb=0, current_size=0, largest_size=8, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=nullqueue, max_size_kb=2560000, current_size_kb=0, current_size=0, largest_size=1, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=parsingqueue, max_size_kb=2560000, current_size_kb=0, current_size=0, largest_size=2, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=syslog_system, max_size_kb=97, current_size_kb=0, current_size=0, largest_size=1, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=syslog_system2, max_size_kb=97, current_size_kb=0, current_size=0, largest_size=1, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=queue, ingest_pipe=2, name=typingqueue, max_size_kb=2560000, current_size_kb=0, current_size=0, largest_size=257, smallest_size=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=syslog_connections, ingest_pipe=2, syslog_system2:x.x.x.x:514:x.x.x.x:514, sourcePort=8089, destIp=x.x.x.x, destPort=514, _tcp_Bps=0.00, _tcp_KBps=0.00, _tcp_avg_thruput=3.71, _tcp_Kprocessed=2266, _tcp_eps=0.00
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=syslog_connections, ingest_pipe=2, syslog_system:y.y.y.y:514:y.y.y.y:514, sourcePort=8089, destIp=y.y.y.y, destPort=514, _tcp_Bps=0.00, _tcp_KBps=0.00, _tcp_avg_thruput=0.35, _tcp_Kprocessed=213, _tcp_eps=0.00
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=tcpout_connections, ingest_pipe=2, name=primary_indexers:z.z.z.z0:9997:0:0, sourcePort=8089, destIp=z.z.z.z0, destPort=9997, _tcp_Bps=1545.33, _tcp_KBps=1.51, _tcp_avg_thruput=1.51, _tcp_Kprocessed=45, _tcp_eps=0.50, kb=44.94
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=thruput, ingest_pipe=2, name=index_thruput, instantaneous_kbps=0.000, instantaneous_eps=0.000, average_kbps=0.000, total_k_processed=0.000, kb=0.000, ev=0
06-13-2025 12:34:19.086 +0800 INFO Metrics - group=thruput, ingest_pipe=2, name=thruput, instantaneous_kbps=4.505, instantaneous_eps=26.000, average_kbps=10.456, total_k_processed=6705.000, kb=139.648, ev=806, load_average=1.500

any abnormalities in this entries ? I suspect that issue is with HF 2 only .. when HF 2 stopped , everyother things works fine ..if HF2 service initiated, it start utilize 1GB to 50GB of memory only out of 130GB , then the HF1 start use memory and log ingestion getting stopped .. especially from syslog server ( large log volume input) -> this index getting affected first
Hence increasing memory in HF 2 not helpful here

gcusello

Hi @Raghavsri ,

i had a similar issue in a past project.

Check the parsing rules, maybe there are some not optimized regexes that requires too much memory, especially regexes that starts with ".*"

Ciao.

Giuseppe

Prewin27

@Raghavsri
Whats the version of splunk you are running?

Also to start with, check few options.
Review logs: Look for errors, warnings, or abnormal behavior in splunkd.log
Check destination health: Ensure that SyslogNG and the second indexer cluster are healthy and accepting data efficiently
Also If HF2 is not able to forward data fast enough (due to network, destination, or performance issues), the queue fills up, consuming memory

Memory upgrade: Increasing memory on HF2 may help if the issue is due to legitimate high data volume and not a leak or misconfiguration. However, if the problem is a memory leak/bandwidth issue, increasing memory will only delay the inevitable crash

Load Balancing: Consider load balancing across multiple HFs if possible, to distribute the data load

Monitor memory usage: Set up alerts for high memory usage to detect issues early.

Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a kudos/Karma. Thanks!

Raghavsri

9.2.2 is the current version.

so the queue fillup and memory consumption in HF2, may be due to outgoing traffic ? it wont cause due to incoming large data , routing from HF1..

yes, we plan to configure add one more HF in HF2 layer as LB but it take some time. but we need to fix current ongoing issue.

HF consuming almost full memory due to high pipeline size

heavy forwarder

intermediate forwarder

Splunk Answers Content Calendar, July Edition I

Secure Your Future: Mastering Upgrade Readiness for Splunk 10

Observability Unlocked: Kubernetes & Cloud Monitoring with Splunk IM

Are you a member of the Splunk Community?