We have a Linux server running Splunk forwarder which forwards to one of two heavy forwarders in an autolb configuration.
The Splunk forwarder reports that it connects to the heavy forwarder, but I get a message in splunkd.log that says
forwarding to indexer group default-autolb group blocked for <nnnnn> seconds.
From the point of view of the deployment monitor running on the indexer, the Splunk forwarder in question is "missing".
Please help us diagnose our problem as we have a demo to a customer tomorrow.
Plenty of suggestions here:
Jkat54 - thanks for your response - here is some more data
I'm seeing the light forwarders connecting on/off to the heavy fwds, but the connections keep dropping
On light forwards, I'm getting errors like :
read operation timed out expecting ack form
Possible duplication of events with channel=source ...offset = on host Raw connection to timed out Forwarding blocked...
Applying quarantine to
Removing quarantine from
On heavy fwds, I get erros like :
Forwarding to blocked
From the point of view of the deployment monitor, all the light fowrders in the system keep toggling between active and missing....
if on the light forwarders I do: ./splunk list forward-server, I do not get consistent results...
we're using ssl...netstat reports connections on port 8081 (used from light fwds to heavyfwds) and 8082 (heavy fwds to indexer)
We can close this. Of the many servers (splunk light forwarders) that were failing to report, I rebooted one of the ones that was reporting all the forwarding blocked error messages. Within 2 minutes the other servers began reporting in and within 15 minutes, all 34 servers in the domain had successfully reported and forwarded a days' worth of data to the heavy forwarders.
Though the issue is fixed, I'd like to know if there is something that we did or something in our config to cause this to happen. Is there a tuning param set too tight, for example.
thank again to Jkat54
Thanks for any feedback you can give here.
Nothing strikes me as being 'the problem'. Believe it or not, restarting To fix the problem works fairly often.
In your case i would set up an alert to monitor your _internal index and alert if the condition occurs again. At least you know the fix next time it happens. If it did continue to happen I would continue digging with a support ticket, etc.