We have a support ticket open, but I thought I'd also ask the community. Since upgrading our Splunk to 8.0.1 this one HF has been spewing "TcpOutputProc - Possible duplication of events " for most channels. As well as "TcpOutputProc - Applying quarantine to ip=xx.xx.xx.xx port=9998 _numberOfFailures=2"
We upgraded on the 15th near midnight. This is a count of those the errors from that host.
Here is a count from the indexer cluster showing the number of blocked=true events. One would expect these to be similar in count if the indexers were telling the HF to go elsewhere because it's queues were full.
index=_internal host=INDEXERNAMES sourcetype=splunkd source=/opt/splunk/var/log/splunk/metrics.log blocked=true component=Metrics
| timechart span=1d count by source
Lastly, it's not just one source or channel, it's everything from the host.
index=_internal component=TcpOutputProc host=ghdsplfwd01lps log_level=WARN duplication
| rex field=event_message "channel=source::(?[^|]+)"
| stats count by channel
web_ping://EnterWorks PRD 27871
web_ping://RTP DEV 27870
If you have many separate transforms on props.conf for individual source/source type etc. try to combine those to one line e.g.
TRANSFORMS-foo = foo1
TRANSFORMS-bar = bar1
TRANSFORMS-foobar = foo1, bar1
This helps in our case after update 6.6.5 to 7.3.3.
The HF is still "sick" but here are some things we did that seemed to help.
I'm a little concerned about #2 there. We could still be having issues with the outputs, only now the events are being dropped on the floor. In other words the condition may still be present, we have simply turned off the logging by removing useAck.