Re: How to see Root cause for Ingestion Latency

tdavison76 · ‎06-07-2024

Hello, we have a Red status for Ingestion Latency, it says the following:

Red: The feature has severe issues and is negatively impacting the functionality of your deployment. For details, see Root Cause.

However, I can't figure out how to see the "Root Cause". What report should I look at, that would show me where this latency is occurring?

Thanks for all of the help,

Tom

inventsekar · ‎06-07-2024

Hi @tdavison76

some more details pls..

is it cloud or on-prim?

where do you see that red (we have a Red status for Ingestion Latency)...is it on any dashboard or is it on DMC

thanks and best regards,
Sekar

PS - If this or any post helped you in any way, pls consider upvoting, thanks for reading !

tdavison76 · ‎06-07-2024

Hello, Thank you for your help, I am seeing the Red status in the Health Report. We are using on-prem. Right now it is showing Yellow, but it frequently flips to Red. In the Description it says to look at Root Cause for details, but I can't figure out how to look at "Root Cause"

Thanks again,

Tom

PickleRick · ‎06-08-2024

The Ingestion Latency indicator is based on "checkpoint" files generated by the forwarders. The file (var/spool/tracker/tracker.log) is periodically generated on a UF and contains a timestamp which is compared by Splunk aftern ingestion to see how long it took for that file to reach the indexer.

There is one possibility when the alert on latency is a false positive - sometimes the input doesn't properly delete the file when ingesting its contents so new timestamps are appended to the end of the file. It happened to me once or twice.

But other than that latency warning simply means that it takes "too long" for the data to get from being read by UF to bing indexed by the indexers. The possible reasons include:

1. Load on the forwarder (this is usually not an issue if you're ingesting logs from a server which normally does some other production work and you only ingest its own logs but might be an issue if you have a "log gatherer" setup receiving logs from a wide environment.

2. Throttling on output due to bandwidth limits.

3. Need to ingest a big backlog of events (can happen if the UF wasn't running for some time or if you're installing a fresh UF on a host which was running and already produced logs which you want ingested).

4. Connectivity/configuration problems preventng UF from sending the buffered data to indexers.

5. Blocked receivers due to performance problems.

fatsug

I have similar issues poppping up as of late. But how does one isolate the affected forwarder?

The error message reads

Forwarder Ingestion Latency

Root Cause(s):
- Indicator 'ingestion_latency_gap_multiplier' exceeded configured value. The observed value is 89. Message from <UUID>:<ip-addrs>:54246
Unhealthy Instances:
- indexer1
- indexer2
The "message from" section just lists the UUID, an IP adress and a port. Which part here would help me find the actual forwarder? The UUID does not match any "Client name" under forwarder management on the deployment server. The IP adress does not match a server on which I have a forwarder installed.
One or a few of the indexers are listed as "unhealthy instances" each time. But the actual error sounds like it lives in the forwarder end and not on the indexer.
With the available information in this warning/error. How can I figure out which forwarder is either experiencing latency issues OR need to have that log file mentioned flushed.

How to see Root cause for Ingestion Latency

administration

troubleshooting

using Splunk Enterprise

Fueling your curiosity with new Splunk ILT and eLearning courses

Splunk AI Assistant for SPL 1.1.0 | Now Personalized to Your Environment for Greater ...

Unleash Unified Security and Observability with Splunk Cloud Platform