So we just updated to 8.2.1 and we are now getting an Ingestion Latency error…
How do we correct it? Here is what the link says and then we have an option to view the last 50 messages...
Here are some examples of what is shown as the messages:
07-01-2021 09:28:52.269 -0500 INFO TailingProcessor [66180 MainTailingThread] - Adding watch on path: C:\Program Files\CrushFTP9\CrushFTP.log.
We had this problem after upgrading to v8.2.3 and have found a solution.
After disabling the SplunkUniversal Forwarder, the SplunkLightForwarder and the SplunkForwarder on splunkdev01, the system returned to normal operation. These apps were enabled on the Indexer and should have been disabled by default. Also when trying to load a UniversalForwarder that is not compatible to v8.2.3, it will cause ingestion latency and tailreader errors. We had some Solaris 5.1 servers (forwarders) that are no longer compatible with upgrades so we just kept them on 8.0.5. The upgrade requires Solaris 11 or more.
The first thing I did was go to the web interface, Manage Apps and searched *forward*.
This showed the three Forwarders that I needed to disable and I disabled them on the interface.
I also typed these commands in unix on the indexer:
splunk disable app SplunkForwarder -auth <username>:<password>
splunk disable app SplunkLight -auth <username>:<password>
splunk disable app SplunkUniversalForwarder -auth <username>:<password>
After doing these things the ingestion latency and tailreader errors stopped.
FWIW, my support case is still open. I still have no answers. Although I have many support people telling me the problem doesn't exist, so I reply with screenshots of the problem still existing.
The original resolution suggested was to disable the monitoring/alerting for this service. If anyone is interested in this solution, I'm happy to post it - but as it doesn't solve the underlying issue, and all it does is stop the alert telling you the issue exists, I haven't bothered testing/implementing it myself.
Splunk support have replied and confirmed (finally) that it is a known bug for the ingestion latency for both on-prem and cloud customers.
Their suggested solution is to disable the monitoring/disable the alerts.
Note - this doesn't fix the ingestion issues (that are causing indexing to be skipped & therefore data loss) - only stops warning you about the issue.
Hi, in our case also the same thing happens, it seems to be monitoring the tracker.log file that does not exist in any of our deployed hosts, yet we have this problem in the SH and the Indexer.
In our situation, the problem was actually the permissions on this one particular log file. It appears that when Splunk was upgraded, the permission on the log file was set to root only and splunk was not able to read the log file. We don't run Splunk as a root user, therefore we had no other choice but to change ownership of the file so Splunk could read it. We are running RHEL 8.x, so "chown -R splunk:splunk /opt/splunk" did the trick. Once we restarted Splunk the issue went away immediately.
Just like several others had mentioned previously, we were only seeing the issue on our Cluster Master and no other Splunk application server. Hope this helps!
FOUND A FIX!... or workaround rather and hopefully it works for you all.
i've been working tirelessly with a splunk senior technical support til midnight for the past 2 days in an effort to fault find and fix this problem. Support seem to think it is a scaling issue as they suspect network latency and our 2 indexers being overwhelmed.
This makes no sense to me as our environment is sufficiently scaled based on splunk validated architecture number of users and data ingestion amount.
Anyways, i've spotted the issue classed as uncategorised under 9.0 known issues. Was only logged 2 weeks ago and i'm surprised (or not really) that support failed to pick this up rather take me on a wild goose chase of fault finding
turn off the useACK setting
useACK = false
on any outputs.conf file you can locate on the affected instances then restart for changes to take effect. This should stop the tracker.log errors and data should continuously flow through again
I'm raising my eyebrow at this being a true work around (but certainly hoping that it is). Where I do agree that queues get blocked on the forwarding nodes the verbage is a bit vague.
I've been fighting this issue for weeks and have been on the exact page looking for the issue and didn't find it. I was searching for "ingestion latency" not blocked queues. Block queue is a broad category and usually a performance reason why it's blocked.
If you make it 3 or 4 days without the issue popping back up I'd say this workaround is solid.
Anyway fingers crossed