Hello,
I have a case where the logs from 4 host are lagging behind. Why I say inconsistant is the laggig is differ from 5 to 30 minutes, sometime didn't at all. When the log don't show up 30 minutes or more, I go to the forwarder management and disable/enable apps, restart Splunkd, then the log continue with 1, 2 seconds lag.
The other host also lagging behind at peak hour, but only for 1 or 2 minutes (maximum 5' for source with large amount of logs).
I admit that our indexer cluster is not up to par in IOPS requirement but for 4 paticular host to be visible underperform is quite concerning.
Can someone show me steps to debug and solve the problems.
Are these UFs? Did you change the default thruput limit?
Yes, they're UFs. I already set
[thruput]
maxKBps = 0
in limits.conf in the app.
1) Share which OS version, which UF version, and roughly how many inputs on those hosts
2) Search _internal for your hostname(IP) for error codes
2.1) Is the UF generating errors
2.2) Does the UF get indexing paused/congested reports back from the IDX tier.
2.3) Does the UF show round robin to all IDX elements or is there a discrepancy in outputs.conf?
Lets start with these.
After some investigation, the answer is:
1) The OS is Linux Redhat 8, Splunk UF version 9.1.1, we have 2 deployment of Splunk which is Splunk Enterprise and Splunk Security, on my end (Splunk Enterprise) there are only 2 inputs but on the Security end, there are a lot, with 2 apps HG_TA_Splunk_Nix and TA_nmon (roughly 40 inputs each) over 4 hosts.
2.1) There are some but not noteworthy ERROR. The errors are below:
+700 ERROR TcpoutputQ [11073 TcpOutEloop] - Unexpected event id=<eventid> -> benign ERROR as per Splunk dev
+700 ERROR ExecProcessor [32056 ExecProcessor] - message from "$SPLUNKHOME/HG_TA_Splunk_Nix/bin/update.sh" https://repo.napas.local/centos/7/updates/x84_64/repodata/repomd.xml: [Errorno14] curl#7 - "Failed to connect to repo.napas.local:80; No route to host"
2.2) HealthReporter show
+700 INFO PeriodHealthReporter - feature="Ingestion latency" color=red/yellow indicator="ingestion_latency_gap_multiplier" due_to_threshold_value=1 measured_value=26684 reason=Events from tracker.log have not been seen for the last 26684 seconds, which is more than the red threshold ( 210 seconds ). This typically occurs when indexing or forwarding are falling behind or are blocked." node_type=indicator node_path=splunkd.file_monitor_input.ingestion_latency.ingestion_latency_gap_multiplier.
2.3) log _internal |stats count by destIP show
idx1: 14248
idx2: 8014
idx3: 7963
idx4: 7809
Which is more concerning than I thought it would be.
2.4) Another find. The log is now lagging 1 hour behind, and still being pulled/ingest. But the internal log had stop, the time now is 9:08, but the last internal log is 8:19, with no error, which is
+700 Metrics - group=thruput, name=uncooked_output, instantaneous_kbps=0.000, instantaneous_eps=0.000, average_kbps=0.000, total_k_processed=0.000, kb=0.000, ev=0, interval_sec=60
Here is an excellent conf presentation, how to find the reason for this lag https://conf.splunk.com/files/2019/slides/FN1570.pdf