I am seeing duplicate events in a metrics index, help!
deployment flow:
hec client--->load balancer--->HFs (hec receivers)--->Indexers (metrics index)
note: HF is using useACK=true in outputs.conf which is causing the duplication of events.
root cause:
useACK is not available for events where the _raw field is missing. By design, metrics data does not contain an _raw field.
Do not use useACK=true if you are sending events to a metrics index where the event is missing _raw. useACK is implemented to track _raw. If _raw is not presented, there is no ACK sent back (from indexer to HF) which will cause duplicate events.
produce the problem:
send 1 json event to metrics index:
curl -k https://lb.sv.splunk.com:8088/services/collector \
-H "Authorization: Splunk <token>" \
-d '{"time": 1614193927.000,"source":"disk","host":"host_77","fields":{"region":"us-west-1","datacenter":"us-west-1a","rack":"63","os":"Ubuntu16.10","arch":"x64","team":"LON","service":"6","service_version":"0","service_environment":"test","path":"/dev/sda1","fstype":"ext3","_value":1099511627776,"metric_name":"total"}}'
wait 5m and search the metric index and see the duplicate event
| msearch index="mymetrics"
see the ACK timeout and duplication because indexers never send back ACK so the HF sends the event again and again every 300s.
ie:
HF splunkd.log:
02-24-2021 14:47:49.702 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:47:49.703 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997
02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997
solution:
1.) create two output groups on the HF, one for event data and one for metric data. For the outputgroup for metric data set useACK=false
ie: HF:
outputs.conf:
[tcpout]
defaultGroup = clustered_indexers_with_useACK
[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true
[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false
2.) on specific http input stanzas where metric data is received, use outputgroup attribute to send the the outputgroup where useACK=false
HF inputs.conf:
[http://<name>]
outputgroup = <string>
* The name of the forwarding output group to send data to.
* Default: empty string
example HF:
inputs.conf
[http]
busyKeepAliveIdleTimeout = 90
dedicatedIoThreads = 2
disabled = 0
enableSSL = 1
port = 8088
queueSize = 2MB
[http://metrics_data]
disabled = 0
host = hf1
index = mymetrics
indexes = mymetrics
sourcetype = _json
token = <token>
outputgroup = clustered_indexers_without_useACK
[http://event_data]
disabled = 0
host = hf1
index = main
indexes = main
sourcetype = _json
token = <token>
Since defaultGroup= clustered_indexers_with_useACK , if we don't specify any outputgroup the data gets sent to the default outputgroup so we dont need to declare it in the [http://event_data] input stanza.
note: HF is using useACK=true in outputs.conf which is causing the duplication of events.
root cause:
useACK is not available for events where the _raw field is missing. By design, metrics data does not contain an _raw field.
Do not use useACK=true if you are sending events to a metrics index where the event is missing _raw. useACK is implemented to track _raw. If _raw is not presented, there is no ACK sent back (from indexer to HF) which will cause duplicate events.
produce the problem:
send 1 json event to metrics index:
curl -k https://lb.sv.splunk.com:8088/services/collector \
-H "Authorization: Splunk <token>" \
-d '{"time": 1614193927.000,"source":"disk","host":"host_77","fields":{"region":"us-west-1","datacenter":"us-west-1a","rack":"63","os":"Ubuntu16.10","arch":"x64","team":"LON","service":"6","service_version":"0","service_environment":"test","path":"/dev/sda1","fstype":"ext3","_value":1099511627776,"metric_name":"total"}}'
wait 5m and search the metric index and see the duplicate event
| msearch index="mymetrics"
see the ACK timeout and duplication because indexers never send back ACK so the HF sends the event again and again every 300s.
ie:
HF splunkd.log:
02-24-2021 14:47:49.702 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:47:49.703 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997
02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997
solution:
1.) create two output groups on the HF, one for event data and one for metric data. For the outputgroup for metric data set useACK=false
ie: HF:
outputs.conf:
[tcpout]
defaultGroup = clustered_indexers_with_useACK
[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true
[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false
2.) on specific http input stanzas where metric data is received, use outputgroup attribute to send the the outputgroup where useACK=false
HF inputs.conf:
[http://<name>]
outputgroup = <string>
* The name of the forwarding output group to send data to.
* Default: empty string
example HF:
inputs.conf
[http]
busyKeepAliveIdleTimeout = 90
dedicatedIoThreads = 2
disabled = 0
enableSSL = 1
port = 8088
queueSize = 2MB
[http://metrics_data]
disabled = 0
host = hf1
index = mymetrics
indexes = mymetrics
sourcetype = _json
token = <token>
outputgroup = clustered_indexers_without_useACK
[http://event_data]
disabled = 0
host = hf1
index = main
indexes = main
sourcetype = _json
token = <token>
Since defaultGroup= clustered_indexers_with_useACK , if we don't specify any outputgroup the data gets sent to the default outputgroup so we dont need to declare it in the [http://event_data] input stanza.
deployment flow:
client--->load balancer--->HFs--->Indexers
note:
if you are trying to set the output group to your token via splunk web UI you may notice that no output groups show up.
Settings>Data>Inputs>HTTP Event Collect > New Token
You must configure the groups for disabled = false in outputs.conf for them to appear in the UI.
ie:
[tcpout]
defaultGroup = clustered_indexers_with_useACK
disabled = false
[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true
disabled = false
[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false
disabled = false