Re: Logging lagging behind

mmoermans · ‎11-23-2017

During the day one specific type of logging seems to lag behind quite a lot. From 10 minutes behind at the start of the day to 90m towards the end. It's logging from a heavy forwarder (Checkpoint logging) where two hosts are coming from the same heavyfowarder, but one lags behind and the other one isn't:

11-23-2017 16:07:02.058 +0100 INFO  Metrics - group=per_host_thruput, series="host1", kbps=160.766222, eps=154.383526, kb=4983.868164, ev=4786, avg_age=0.820727, max_age=2

11-23-2017 16:07:02.058 +0100 INFO  Metrics - group=per_host_thruput, series="host2", kbps=519.186108, eps=509.020485, kb=16095.141602, ev=15780, avg_age=4667.346071, max_age=5139

What can we do to determine what's causing the lag? The Heavy Forwarder is running fine and the Indexers aren't the issue either.

jgbricker · ‎02-19-2019

I have the same issue. I was planning to refresh the opsec-lea connection with the latest Splunk Checkpoint addon on a new heavy forwarder with no other ingestion/parsing (isolated) and then go from there. I will likely end up doing that and opening a case.

If you haven't setup the Checkpoint addon and know how the configuration works, then it is hard to give guidance for this issue. This is a scripted input which uses an opsec-lea connection/pull logs from a queue method.

All logs come from the Checkpoint management server. I suspect the issue is that there is more logs to ingest in the queue than the script is able to pull. I think of the issue like trying to bucket the water out of the sinking Titanic.

I also understand that they have a syslog option now so that may be the answer. I just need to verify it gives all the needed info.

I will re-post if I find the answer.

mmoermans · ‎02-20-2019

Hi, thanks for the input. We've tried several different things (new heavy forwarder included) and the issue persists.
We're trying the syslog route instead now since it offers more redundancy.

jbullough · ‎03-02-2018

I have this exact issue. Did you ever find a solution?

woodcock · ‎11-27-2017

Do not forward compressed files if many are created simiultaneously:
https://answers.splunk.com/answers/542428/indexing-compressed-vs-raw-logs.html

Perhaps you are monitoring rolled files instead of the "hot" file?
Make sure there is enough headroom in the thruput limit in limits.conf for peak times; the default setting is way too s/low (50MB/min).

You can examine current state of the tailing processor as described here:
http://blogs.splunk.com/2011/01/02/did-i-miss-christmas-2/

How "deep" is the directory that you are monitoring and do you have a depth set?
How many files are in the directories that you are monitoring and do you ever delete them?
If you have too many files (stat is not free), then Splunk will run slower and slower and slower as more files accumulate; something must clean house periodically.
If the files are written atomically (not appended continuously), consider using batch and sinkhole which does the housekeeping for you inside of Splunk:
http://docs.splunk.com/Documentation/Splunk/latest/admin/Inputsconf
I have also done this:
https://answers.splunk.com/answers/309910/how-to-monitor-a-folder-for-newest-files-only-file.html

Or it could be even be something crazy like this:
https://answers.splunk.com/answers/51399/forwarder-stops-forwarding.html

mmoermans · ‎11-27-2017

It's checkpoint logging on a heavy forwarder which grabs the logs through a python script called lealoggrabber, which makes it hard to tail the logging.

Thruput in limits.conf is set to 10MB and the specific queues have been increased as well (were on 500kbs, were increased to 20MB since they showed blocked=true).

11-27-2017 15:35:49.756 +0100 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=20480, current_size_kb=20479, current_size=32086, largest_size=32086, smallest_size=0

11-27-2017 15:35:49.756 +0100 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=20480, current_size_kb=20479, current_size=19334, largest_size=19334, smallest_size=0

The machine itself shows use of 40-50% CPU and memory use is about 30% on it.
If you chart out the delay it shows a delay of almost 2 hours during the day and it catches up during the evening / weekends. It's about 200 GB / day worth of logging.

woodcock · ‎11-27-2017

How many indexers? If you have only a single Indexer, you are about at the limit of what it can handle in a timely manner, just on this one input (presumably you have others, too).

woodcock · ‎11-25-2017

If the lag is consistent, it probably due to the timestamp being misinterpreted, for example it is really CDT but is actually PDT. This is an exceedingly common problem.

DalJeanis · ‎11-24-2017

How do the daily entire-day stats compare against each other? Which one has more events each day? Are the two hosts configured identically? On the lagging host, does the lag always increase, or does it sometimes drop as well?

Your data seems to show that host2 chunks at the end of the day contain events that vary by almost 1000 seconds from each other (2*(5139-4667)=944 seconds, so they are only being forwarded every 15 minutes. Check the average age.

index=_internal log_level="INFO" component="Metrics" group="per_host_thruput" 
(series="host1" OR series="host2") 
| bin _time span=1h 
| stats count as chunks avg(avg_age) as avg_age by _time series

| rename COMMENT as "Add zero records for each hour, each day, each host just in case"
| appendpipe [
    | bin _time as day span=1d 
    | stats values(day) as day values(series) as series 
    | eval hour=mvrange(-1,24) 
    | mvexpand hour 
    | mvexpand day 
    | eval _time=day+if(hour<0,hour,3600*hour) 
    | table _time series 
    | mvexpand series 
    | eval chunks=0, avg_age=0
    ]

| rename COMMENT as "Sum them up with the zero records, then chart them"
| stats sum(chunks) as chunks, max(avg_age) as avg_age by series _time
| timechart span=1h avg(avg_age) as age by series

You can also check what is happening with the average chunk size across the day...

index=_internal log_level="INFO" component="Metrics" group="per_host_thruput"
(series="host1" OR series="host2") 
| bin _time span=1h 
| stats count as chunks avg(kb) as avg_kb by _time series

| rename COMMENT as "Add zero records for each hour, each day, each host just in case"
| appendpipe [
    | bin _time as day span=1d 
    | stats values(day) as day values(series) as series 
    | eval hour=mvrange(-1,24) 
    | mvexpand hour 
    | mvexpand day 
    | eval _time=day+if(hour<0,hour,3600*hour) 
    | table _time series 
    | mvexpand series 
    | eval chunks=0, avg_kb=0
    ]

| rename COMMENT as "Sum them up with the zero records, then chart them"
| stats sum(chunks) as chunks, max(avg_kb) as avg_kb by series _time
| timechart span=1h avg(avg_kb) as avg_kb by series

jplumsdaine22 · ‎11-24-2017

How does the data get to the heavy forwarder? is it over syslog? is there a universal forwarder in the mix? Also check the following on the heavy forwarder
index=_internal blocked OR throttled

Logging lagging behind

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Are you a member of the Splunk Community?

Logging lagging behind

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...