Hi everyone, I have implemented Splunk for my architecture at work, to give you an insight, the architecture is pretty big, consisting of 3 sites. So, I have 3 heavy forwarders in total (1 for each site), and 2 indexers. The alert (forwarder ingestion latency) is showing on all the HFs and the 2 indexers. I have reviewed all the recommendations that were given by the community as well as Splunk documentation, with no luck. The issue is that now some logs are being lost because of this issue. I have increased the queue size on all servers up to 100MB and turned ACK off. Still, I am facing issues. Does anyone have any recommendation that might be useful?
Latency on its own will not cause event loss. It only shows the delay in processing. It can however be a symptom of a bigger underlying problem.
Inflating queues will only help you if you have spikes in data flow. It won't help if the indexers can't keep up with the ingested data.
What is your ingest rate? What is your indexers' spec? Do these HFs work as intermediate forwarders? Do you experience data loss from all sources or only some of them? How are your limits.conf configured on each component? There are so many questions to answer here.
And 2 indexers is not that "big" so if your environment is indeed big, you might as @richgalloway suggested need to scale up.
Are you seeing queueing in the HFs or indexers? Which queues in the indexer are empty? That will tell us where the delay is coming from.
Can you add indexers? Two may not be enough to support 3 sites, depending on the volume (which is what, BTW?).
Consider adding another HF to each site for redundancy. Why HFs instead of UFs? You'll get better performance with UFs.