During the day one specific type of logging seems to lag behind quite a lot. From 10 minutes behind at the start of the day to 90m towards the end. It's logging from a heavy forwarder (Checkpoint logging) where two hosts are coming from the same heavyfowarder, but one lags behind and the other one isn't:
11-23-2017 16:07:02.058 +0100 INFO Metrics - group=per_host_thruput, series="host1", kbps=160.766222, eps=154.383526, kb=4983.868164, ev=4786, avg_age=0.820727, max_age=2 11-23-2017 16:07:02.058 +0100 INFO Metrics - group=per_host_thruput, series="host2", kbps=519.186108, eps=509.020485, kb=16095.141602, ev=15780, avg_age=4667.346071, max_age=5139
What can we do to determine what's causing the lag? The Heavy Forwarder is running fine and the Indexers aren't the issue either.
I have the same issue. I was planning to refresh the opsec-lea connection with the latest Splunk Checkpoint addon on a new heavy forwarder with no other ingestion/parsing (isolated) and then go from there. I will likely end up doing that and opening a case.
If you haven't setup the Checkpoint addon and know how the configuration works, then it is hard to give guidance for this issue. This is a scripted input which uses an opsec-lea connection/pull logs from a queue method.
All logs come from the Checkpoint management server. I suspect the issue is that there is more logs to ingest in the queue than the script is able to pull. I think of the issue like trying to bucket the water out of the sinking Titanic.
I also understand that they have a syslog option now so that may be the answer. I just need to verify it gives all the needed info.
I will re-post if I find the answer.
Hi, thanks for the input. We've tried several different things (new heavy forwarder included) and the issue persists.
We're trying the syslog route instead now since it offers more redundancy.
Do not forward compressed files if many are created simiultaneously:
Perhaps you are monitoring rolled files instead of the "hot" file?
Make sure there is enough headroom in the
thruput limit in
limits.conf for peak times; the default setting is way too s/low (50MB/min).
You can examine current state of the tailing processor as described here:
How "deep" is the directory that you are monitoring and do you have a
How many files are in the directories that you are monitoring and do you ever delete them?
If you have too many files (
stat is not free), then Splunk will run slower and slower and slower as more files accumulate; something must clean house periodically.
If the files are written atomically (not appended continuously), consider using
sinkhole which does the housekeeping for you inside of Splunk:
I have also done this:
Or it could be even be something crazy like this:
It's checkpoint logging on a heavy forwarder which grabs the logs through a python script called lealoggrabber, which makes it hard to tail the logging.
Thruput in limits.conf is set to 10MB and the specific queues have been increased as well (were on 500kbs, were increased to 20MB since they showed blocked=true).
11-27-2017 15:35:49.756 +0100 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=20480, current_size_kb=20479, current_size=32086, largest_size=32086, smallest_size=0 11-27-2017 15:35:49.756 +0100 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=20480, current_size_kb=20479, current_size=19334, largest_size=19334, smallest_size=0
The machine itself shows use of 40-50% CPU and memory use is about 30% on it.
If you chart out the delay it shows a delay of almost 2 hours during the day and it catches up during the evening / weekends. It's about 200 GB / day worth of logging.
How do the daily entire-day stats compare against each other? Which one has more events each day? Are the two hosts configured identically? On the lagging host, does the lag always increase, or does it sometimes drop as well?
Your data seems to show that host2 chunks at the end of the day contain events that vary by almost 1000 seconds from each other (2*(5139-4667)=944 seconds, so they are only being forwarded every 15 minutes. Check the average age.
index=_internal log_level="INFO" component="Metrics" group="per_host_thruput" (series="host1" OR series="host2") | bin _time span=1h | stats count as chunks avg(avg_age) as avg_age by _time series | rename COMMENT as "Add zero records for each hour, each day, each host just in case" | appendpipe [ | bin _time as day span=1d | stats values(day) as day values(series) as series | eval hour=mvrange(-1,24) | mvexpand hour | mvexpand day | eval _time=day+if(hour<0,hour,3600*hour) | table _time series | mvexpand series | eval chunks=0, avg_age=0 ] | rename COMMENT as "Sum them up with the zero records, then chart them" | stats sum(chunks) as chunks, max(avg_age) as avg_age by series _time | timechart span=1h avg(avg_age) as age by series
You can also check what is happening with the average chunk size across the day...
index=_internal log_level="INFO" component="Metrics" group="per_host_thruput" (series="host1" OR series="host2") | bin _time span=1h | stats count as chunks avg(kb) as avg_kb by _time series | rename COMMENT as "Add zero records for each hour, each day, each host just in case" | appendpipe [ | bin _time as day span=1d | stats values(day) as day values(series) as series | eval hour=mvrange(-1,24) | mvexpand hour | mvexpand day | eval _time=day+if(hour<0,hour,3600*hour) | table _time series | mvexpand series | eval chunks=0, avg_kb=0 ] | rename COMMENT as "Sum them up with the zero records, then chart them" | stats sum(chunks) as chunks, max(avg_kb) as avg_kb by series _time | timechart span=1h avg(avg_kb) as avg_kb by series
How does the data get to the heavy forwarder? is it over syslog? is there a universal forwarder in the mix? Also check the following on the heavy forwarder
index=_internal blocked OR throttled