We have a support ticket open, but I thought I'd also ask the community. Since upgrading our Splunk to 8.0.1 this one HF has been spewing "TcpOutputProc - Possible duplication of events " for most channels. As well as "TcpOutputProc - Applying quarantine to ip=xx.xx.xx.xx port=9998 _numberOfFailures=2"
We upgraded on the 15th near midnight. This is a count of those the errors from that host.
2020-02-14 0
2020-02-15 623
2020-02-16 923874
2020-02-17 396920
2020-02-18 678568
2020-02-19 602100
2020-02-20 459284
2020-02-21 1177642
Here is a count from the indexer cluster showing the number of blocked=true events. One would expect these to be similar in count if the indexers were telling the HF to go elsewhere because it's queues were full.
index=_internal host=INDEXERNAMES sourcetype=splunkd source=/opt/splunk/var/log/splunk/metrics.log blocked=true component=Metrics
| timechart span=1d count by source
2020-02-14 7
2020-02-15 180
2020-02-16 260
2020-02-17 15
2020-02-18 18
2020-02-19 2415
2020-02-20 1
2020-02-21 2
Lastly, it's not just one source or channel, it's everything from the host.
index=_internal component=TcpOutputProc host=ghdsplfwd01lps log_level=WARN duplication
| rex field=event_message "channel=source::(?[^|]+)"
| stats count by channel
/opt/splunk/var/log/introspection/disk_objects.log 51395
/opt/splunk/var/log/introspection/resource_usage.log 45470
mule-prod-analytics 42192
/opt/splunk/var/log/splunk/metrics.log 28283
web_ping://PROD_CommerceHub 27881
web_ping://V8_PROD_CustomSolr5 27877
web_ping://V8_PROD_WebServer4 27873
web_ping://EnterWorks PRD 27871
web_ping://RTP DEV 27870
web_ping://Ensighten 27869
web_ping://RTP 27867
bandwidth 20570
cpu 19949
iostat 19946
ps 19821
Any ideas?
Hi
If you have many separate transforms on props.conf for individual source/source type etc. try to combine those to one line e.g.
TRANSFORMS-foo = foo1
TRANSFORMS-bar = bar1
To
TRANSFORMS-foobar = foo1, bar1
This helps in our case after update 6.6.5 to 7.3.3.
Ismo
The HF is still "sick" but here are some things we did that seemed to help.
I'm a little concerned about #2 there. We could still be having issues with the outputs, only now the events are being dropped on the floor. In other words the condition may still be present, we have simply turned off the logging by removing useAck.