Getting Data In

Log inconsistantly lagging behind

tungpx
Explorer

Hello,

I have a case where the logs from 4 host are lagging behind. Why I say inconsistant is the laggig is differ from 5 to 30 minutes, sometime didn't at all.  When the log don't show up 30 minutes or more, I go to the forwarder management and disable/enable apps, restart Splunkd, then the log continue with 1, 2 seconds lag.

The other host also lagging behind at peak hour, but only for 1 or 2 minutes (maximum 5' for source with large amount of logs). 

I admit that our indexer cluster is not up to par in IOPS requirement but for 4 paticular host to be visible underperform is quite concerning. 

Can someone show me steps to debug and solve the problems. 

Labels (5)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Are these UFs? Did you change the default thruput limit?

0 Karma

tungpx
Explorer

Yes, they're UFs. I already set 

[thruput]

maxKBps = 0

in limits.conf in the app.

0 Karma

dural_yyz
Builder

1) Share which OS version, which UF version, and roughly how many inputs on those hosts

2) Search _internal for your hostname(IP) for error codes

2.1) Is the UF generating errors

2.2) Does the UF get indexing paused/congested reports back from the IDX tier.

2.3) Does the UF show round robin to all IDX elements or is there a discrepancy in outputs.conf?

Lets start with these.

0 Karma

tungpx
Explorer

After some investigation, the answer is:

1) The OS is Linux Redhat 8, Splunk UF version 9.1.1, we have 2 deployment of Splunk which is Splunk Enterprise and Splunk Security, on my end (Splunk Enterprise) there are only 2 inputs but on the Security end, there are a lot, with 2 apps HG_TA_Splunk_Nix and TA_nmon (roughly 40 inputs each) over 4 hosts.

2.1) There are some but not noteworthy ERROR. The errors are below:

+700 ERROR TcpoutputQ [11073 TcpOutEloop] - Unexpected event id=<eventid>  -> benign ERROR as per Splunk dev

+700 ERROR ExecProcessor [32056 ExecProcessor] - message from "$SPLUNKHOME/HG_TA_Splunk_Nix/bin/update.sh" https://repo.napas.local/centos/7/updates/x84_64/repodata/repomd.xml: [Errorno14] curl#7 - "Failed to connect to repo.napas.local:80; No route to host"

2.2) HealthReporter show

+700 INFO PeriodHealthReporter - feature="Ingestion latency" color=red/yellow indicator="ingestion_latency_gap_multiplier" due_to_threshold_value=1 measured_value=26684 reason=Events from tracker.log have not been seen for the last 26684 seconds, which is more than the red threshold ( 210 seconds ). This typically occurs when indexing or forwarding are falling behind or are blocked." node_type=indicator node_path=splunkd.file_monitor_input.ingestion_latency.ingestion_latency_gap_multiplier.

2.3) log _internal |stats count by destIP show 

idx1: 14248

idx2: 8014

idx3: 7963

idx4: 7809

Which is more concerning than I thought it would be. 

2.4) Another find. The log is now lagging 1 hour behind, and still being pulled/ingest. But the internal log had stop, the time now is 9:08, but the last internal log is 8:19, with no error, which is

+700 Metrics - group=thruput, name=uncooked_output, instantaneous_kbps=0.000, instantaneous_eps=0.000, average_kbps=0.000, total_k_processed=0.000, kb=0.000, ev=0, interval_sec=60

 

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Here is an excellent conf presentation, how to find the reason for this lag https://conf.splunk.com/files/2019/slides/FN1570.pdf

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...