Getting Data In

Log inconsistantly lagging behind

tungpx
Explorer

Hello,

I have a case where the logs from 4 host are lagging behind. Why I say inconsistant is the laggig is differ from 5 to 30 minutes, sometime didn't at all.  When the log don't show up 30 minutes or more, I go to the forwarder management and disable/enable apps, restart Splunkd, then the log continue with 1, 2 seconds lag.

The other host also lagging behind at peak hour, but only for 1 or 2 minutes (maximum 5' for source with large amount of logs). 

I admit that our indexer cluster is not up to par in IOPS requirement but for 4 paticular host to be visible underperform is quite concerning. 

Can someone show me steps to debug and solve the problems. 

Labels (5)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Are these UFs? Did you change the default thruput limit?

0 Karma

tungpx
Explorer

Yes, they're UFs. I already set 

[thruput]

maxKBps = 0

in limits.conf in the app.

0 Karma

dural_yyz
Motivator

1) Share which OS version, which UF version, and roughly how many inputs on those hosts

2) Search _internal for your hostname(IP) for error codes

2.1) Is the UF generating errors

2.2) Does the UF get indexing paused/congested reports back from the IDX tier.

2.3) Does the UF show round robin to all IDX elements or is there a discrepancy in outputs.conf?

Lets start with these.

0 Karma

tungpx
Explorer

After some investigation, the answer is:

1) The OS is Linux Redhat 8, Splunk UF version 9.1.1, we have 2 deployment of Splunk which is Splunk Enterprise and Splunk Security, on my end (Splunk Enterprise) there are only 2 inputs but on the Security end, there are a lot, with 2 apps HG_TA_Splunk_Nix and TA_nmon (roughly 40 inputs each) over 4 hosts.

2.1) There are some but not noteworthy ERROR. The errors are below:

+700 ERROR TcpoutputQ [11073 TcpOutEloop] - Unexpected event id=<eventid>  -> benign ERROR as per Splunk dev

+700 ERROR ExecProcessor [32056 ExecProcessor] - message from "$SPLUNKHOME/HG_TA_Splunk_Nix/bin/update.sh" https://repo.napas.local/centos/7/updates/x84_64/repodata/repomd.xml: [Errorno14] curl#7 - "Failed to connect to repo.napas.local:80; No route to host"

2.2) HealthReporter show

+700 INFO PeriodHealthReporter - feature="Ingestion latency" color=red/yellow indicator="ingestion_latency_gap_multiplier" due_to_threshold_value=1 measured_value=26684 reason=Events from tracker.log have not been seen for the last 26684 seconds, which is more than the red threshold ( 210 seconds ). This typically occurs when indexing or forwarding are falling behind or are blocked." node_type=indicator node_path=splunkd.file_monitor_input.ingestion_latency.ingestion_latency_gap_multiplier.

2.3) log _internal |stats count by destIP show 

idx1: 14248

idx2: 8014

idx3: 7963

idx4: 7809

Which is more concerning than I thought it would be. 

2.4) Another find. The log is now lagging 1 hour behind, and still being pulled/ingest. But the internal log had stop, the time now is 9:08, but the last internal log is 8:19, with no error, which is

+700 Metrics - group=thruput, name=uncooked_output, instantaneous_kbps=0.000, instantaneous_eps=0.000, average_kbps=0.000, total_k_processed=0.000, kb=0.000, ev=0, interval_sec=60

 

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Here is an excellent conf presentation, how to find the reason for this lag https://conf.splunk.com/files/2019/slides/FN1570.pdf

0 Karma
Get Updates on the Splunk Community!

Mastering Data Pipelines: Unlocking Value with Splunk

 In today's AI-driven world, organizations must balance the challenges of managing the explosion of data with ...

The Latest Cisco Integrations With Splunk Platform!

Join us for an exciting tech talk where we’ll explore the latest integrations in Cisco &#43; Splunk! We’ve ...

AI Adoption Hub Launch | Curated Resources to Get Started with AI in Splunk

Hey Splunk Practitioners and AI Enthusiasts! It’s no secret (or surprise) that AI is at the forefront of ...