Currently, I have 2 seperate clusters. One 'old' 6.0 cluster, and a new cluster for 6.2.
The idea is to have our forwarders forwarding to both clusters at the same time. I modified the outputs.conf on the forwarders, and can see events coming in on both clusters. So far, so good.
When I take a closer look, I can see events dropping on most forwarders:
index=_internal sourcetype=splunkd "has begun dropping events"
I can't find the root cause of this. No queues are blocked, network seems to be ok, and the indexers (both clusters) are fine too. Also, when I look closer on the local queues, I cannot see any alarming levels as well. No throtteling either (no maxkbps messages)
index=_internal source="/opt/splunkforwarder/var/log/splunk/metrics.log" group=queue current_size_kb>0
Only message that occurs frequently is "File descriptor cache is full (100), trimming". For what I could find, it should be regarderd as an informational message, not really harming anything.
Who can help me out to find the actual bottleneck?
Might be useful, the actual error msg:
06-10-2015 14:00:25.833 +0200 INFO TcpOutputProc - Queue for group splunknw has begun dropping events
06-10-2015 14:00:25.833 +0200 INFO TcpOutputProc - Queue for group splunknw has stopped dropping events
06-10-2015 14:00:34.829 +0200 INFO TcpOutputProc - Queue for group splunknw has begun dropping events