recently I worked on issue where Splunk Universal Forwarder using useACK=true reported using meory over 24GB. Normal usage is around 2-3GB. In this post have decided to share teh steps taken to resolve it.
In order to debug such cases, here are some pointers.
1)On the Forwarder disable useACK=false in outputs,conf and confirmed that issue does not happen.
2) run splunk using memory profile like below.
– mkdir /tmp/memprofile
– MALLOC_CONF=“prof:true,prof_accum:true,prof_leak:true,lg_prof_interval:28,prof_prefix:/tmp/memprofile/heap_data” splunk start splunkd
Once Splunk start wait for issue to re-appear and gathered heap files from /tmp/memprofile/heap_data
Using jeprof create the visualization of the heapdump (the binary we used is already unstripped)
jeprof --lib_prefix=$SPLUNK_HOME/lib --svg $SPLUNK_HOME/bin/splunkd > output.svg
in our case heaps shows UTF8Processor using most of the memory.
3) reviewed the data that was ingested over time and found that memory utilization on Universal forwarder increased when ingested log files are binary log'.s
4)What seems to be occurring is splunk is monitoring a file that contains a large event that splunk does not know how to break.
When the UF reads this, it needs to send the full event to the same indexer for indexing. When using useACK, the forward needs to stay with the same indexer until the full event is ACK'ed. Since the event is large and thruput it throttled to 512Kb, this event takes a while to transfer. In the meantime the forwarder is parsing the input and if it is non-UTF8, tries to convert it to UTF8. This is causing the memory bloat on the forwarder as the event is waiting to be sent to the indexer and ACK'ed. This is consistent with the memory growth we see and why it only happens with particular files (and my guess is particular events in these files) where events are large and splunk is unable to break then at the forwarder.
This also explains scenarios where we saw memory growth to 24Gb and "healed" itself. Once the full file was sent over and acknowledged, the forwarder is able to cleanup the event and associated memory.
Looked at the input log file test.log that cause the issue It is basically a 18GB file filled with mostly NULL values.So this huge (likely garbage) event is what the forwarder is sending to the indexer which cause the memory bloat. Also we saw the event contained a large amount of binary characters, which would also point to an extremely large event being queued
To remedy this we can use Event Breaking in props.conf on the forwarder. See the Event breaking section here: http://docs.splunk.com/Documentation/Splunk/7.1.3/Data/Resolvedataqualityissues, followed by null queue of NULL charcter on indexer.
The following configuration was implement to filter out this NUll data.
###Universal Forwarder : inputs.conf####
[monitor:///home/rbal//gc.log]
index=main
sourcetype =test
### Universal Forwarder:outputs.conf###
[tcpout]
defaultGroup = esfidxcluster_search_peers
[tcpout:xx]
server = INDX1:9997, INDX2:9997
useACK = true
3)My indexer has the following configuration
###Indexeres : props.conf####
[test]
LINE_BREAKER = (\x00|\\x00|\\\\x00|\r|\n)
TRANSFORMS-colorchange = yellow
###Indexeres : transforms.conf###
[yellow]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = (\x00|\\x00|\\\\x00|\r|\n)