Re: Log inconsistantly lagging behind

tungpx · ‎12-24-2024

Hello,

I have a case where the logs from 4 host are lagging behind. Why I say inconsistant is the laggig is differ from 5 to 30 minutes, sometime didn't at all. When the log don't show up 30 minutes or more, I go to the forwarder management and disable/enable apps, restart Splunkd, then the log continue with 1, 2 seconds lag.

The other host also lagging behind at peak hour, but only for 1 or 2 minutes (maximum 5' for source with large amount of logs).

I admit that our indexer cluster is not up to par in IOPS requirement but for 4 paticular host to be visible underperform is quite concerning.

Can someone show me steps to debug and solve the problems.

PickleRick · ‎12-25-2024

Are these UFs? Did you change the default thruput limit?

tungpx · ‎12-25-2024

Yes, they're UFs. I already set

[thruput]

maxKBps = 0

in limits.conf in the app.

dural_yyz · ‎12-24-2024

1) Share which OS version, which UF version, and roughly how many inputs on those hosts

2) Search _internal for your hostname(IP) for error codes

2.1) Is the UF generating errors

2.2) Does the UF get indexing paused/congested reports back from the IDX tier.

2.3) Does the UF show round robin to all IDX elements or is there a discrepancy in outputs.conf?

Lets start with these.

tungpx · ‎12-24-2024

After some investigation, the answer is:

1) The OS is Linux Redhat 8, Splunk UF version 9.1.1, we have 2 deployment of Splunk which is Splunk Enterprise and Splunk Security, on my end (Splunk Enterprise) there are only 2 inputs but on the Security end, there are a lot, with 2 apps HG_TA_Splunk_Nix and TA_nmon (roughly 40 inputs each) over 4 hosts.

2.1) There are some but not noteworthy ERROR. The errors are below:

+700 ERROR TcpoutputQ [11073 TcpOutEloop] - Unexpected event id=<eventid> -> benign ERROR as per Splunk dev

+700 ERROR ExecProcessor [32056 ExecProcessor] - message from "$SPLUNKHOME/HG_TA_Splunk_Nix/bin/update.sh" https://repo.napas.local/centos/7/updates/x84_64/repodata/repomd.xml: [Errorno14] curl#7 - "Failed to connect to repo.napas.local:80; No route to host"

2.2) HealthReporter show

+700 INFO PeriodHealthReporter - feature="Ingestion latency" color=red/yellow indicator="ingestion_latency_gap_multiplier" due_to_threshold_value=1 measured_value=26684 reason=Events from tracker.log have not been seen for the last 26684 seconds, which is more than the red threshold ( 210 seconds ). This typically occurs when indexing or forwarding are falling behind or are blocked." node_type=indicator node_path=splunkd.file_monitor_input.ingestion_latency.ingestion_latency_gap_multiplier.

2.3) log _internal |stats count by destIP show

idx1: 14248

idx2: 8014

idx3: 7963

idx4: 7809

Which is more concerning than I thought it would be.

2.4) Another find. The log is now lagging 1 hour behind, and still being pulled/ingest. But the internal log had stop, the time now is 9:08, but the last internal log is 8:19, with no error, which is

+700 Metrics - group=thruput, name=uncooked_output, instantaneous_kbps=0.000, instantaneous_eps=0.000, average_kbps=0.000, total_k_processed=0.000, kb=0.000, ev=0, interval_sec=60

isoutamo · ‎12-30-2024

Here is an excellent conf presentation, how to find the reason for this lag https://conf.splunk.com/files/2019/slides/FN1570.pdf

Log inconsistantly lagging behind

host

index

indexer

source

universal forwarder

Splunk Observability Cloud's AI Assistant in Action Series: Auditing Compliance and ...

Splunk Community Badges!

What You Read The Most: Splunk Lantern’s Most Popular Articles!

Are you a member of the Splunk Community?

Log inconsistantly lagging behind