Solved: why am I seeing duplicate events in my metrics ind...

rphillips_splk · ‎02-24-2021

I am seeing duplicate events in a metrics index, help!

deployment flow:
hec client--->load balancer--->HFs (hec receivers)--->Indexers (metrics index)

rphillips_splk · ‎02-24-2021

note: HF is using useACK=true in outputs.conf which is causing the duplication of events.

root cause:

useACK is not available for events where the _raw field is missing. By design, metrics data does not contain an _raw field.
Do not use useACK=true if you are sending events to a metrics index where the event is missing _raw. useACK is implemented to track _raw. If _raw is not presented, there is no ACK sent back (from indexer to HF) which will cause duplicate events.

produce the problem:

send 1 json event to metrics index:

curl -k https://lb.sv.splunk.com:8088/services/collector \
-H "Authorization: Splunk <token>" \
-d '{"time": 1614193927.000,"source":"disk","host":"host_77","fields":{"region":"us-west-1","datacenter":"us-west-1a","rack":"63","os":"Ubuntu16.10","arch":"x64","team":"LON","service":"6","service_version":"0","service_environment":"test","path":"/dev/sda1","fstype":"ext3","_value":1099511627776,"metric_name":"total"}}'

wait 5m and search the metric index and see the duplicate event

| msearch index="mymetrics"

see the ACK timeout and duplication because indexers never send back ACK so the HF sends the event again and again every 300s.

ie:
HF splunkd.log:

02-24-2021 14:47:49.702 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:47:49.703 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997

02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997

solution:
1.) create two output groups on the HF, one for event data and one for metric data. For the outputgroup for metric data set useACK=false
ie: HF:
outputs.conf:

[tcpout]
defaultGroup = clustered_indexers_with_useACK

[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true

[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false

2.) on specific http input stanzas where metric data is received, use outputgroup attribute to send the the outputgroup where useACK=false

HF inputs.conf:

[http://<name>]
outputgroup = <string>
* The name of the forwarding output group to send data to.
* Default: empty string

example HF:
inputs.conf

[http]
busyKeepAliveIdleTimeout = 90
dedicatedIoThreads = 2
disabled = 0
enableSSL = 1
port = 8088
queueSize = 2MB


[http://metrics_data]
disabled = 0
host = hf1
index = mymetrics
indexes = mymetrics
sourcetype = _json
token = <token>
outputgroup = clustered_indexers_without_useACK

[http://event_data]
disabled = 0
host = hf1
index = main
indexes = main
sourcetype = _json
token = <token>

Since defaultGroup= clustered_indexers_with_useACK , if we don't specify any outputgroup the data gets sent to the default outputgroup so we dont need to declare it in the [http://event_data] input stanza.

View solution in original post

rphillips_splk · ‎02-24-2021

note: HF is using useACK=true in outputs.conf which is causing the duplication of events.

root cause:

useACK is not available for events where the _raw field is missing. By design, metrics data does not contain an _raw field.
Do not use useACK=true if you are sending events to a metrics index where the event is missing _raw. useACK is implemented to track _raw. If _raw is not presented, there is no ACK sent back (from indexer to HF) which will cause duplicate events.

produce the problem:

send 1 json event to metrics index:

curl -k https://lb.sv.splunk.com:8088/services/collector \
-H "Authorization: Splunk <token>" \
-d '{"time": 1614193927.000,"source":"disk","host":"host_77","fields":{"region":"us-west-1","datacenter":"us-west-1a","rack":"63","os":"Ubuntu16.10","arch":"x64","team":"LON","service":"6","service_version":"0","service_environment":"test","path":"/dev/sda1","fstype":"ext3","_value":1099511627776,"metric_name":"total"}}'

wait 5m and search the metric index and see the duplicate event

| msearch index="mymetrics"

see the ACK timeout and duplication because indexers never send back ACK so the HF sends the event again and again every 300s.

ie:
HF splunkd.log:

02-24-2021 14:47:49.702 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:47:49.703 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997

02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997

solution:
1.) create two output groups on the HF, one for event data and one for metric data. For the outputgroup for metric data set useACK=false
ie: HF:
outputs.conf:

[tcpout]
defaultGroup = clustered_indexers_with_useACK

[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true

[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false

2.) on specific http input stanzas where metric data is received, use outputgroup attribute to send the the outputgroup where useACK=false

HF inputs.conf:

[http://<name>]
outputgroup = <string>
* The name of the forwarding output group to send data to.
* Default: empty string

example HF:
inputs.conf

[http]
busyKeepAliveIdleTimeout = 90
dedicatedIoThreads = 2
disabled = 0
enableSSL = 1
port = 8088
queueSize = 2MB


[http://metrics_data]
disabled = 0
host = hf1
index = mymetrics
indexes = mymetrics
sourcetype = _json
token = <token>
outputgroup = clustered_indexers_without_useACK

[http://event_data]
disabled = 0
host = hf1
index = main
indexes = main
sourcetype = _json
token = <token>

Since defaultGroup= clustered_indexers_with_useACK , if we don't specify any outputgroup the data gets sent to the default outputgroup so we dont need to declare it in the [http://event_data] input stanza.

rphillips_splk · ‎02-24-2021

deployment flow:
client--->load balancer--->HFs--->Indexers

rphillips_splk · ‎03-17-2021

note:
if you are trying to set the output group to your token via splunk web UI you may notice that no output groups show up.
Settings>Data>Inputs>HTTP Event Collect > New Token

You must configure the groups for disabled = false in outputs.conf for them to appear in the UI.

ie:

[tcpout]
defaultGroup = clustered_indexers_with_useACK
disabled = false

[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true
disabled = false

[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false
disabled = false

why am I seeing duplicate events in my metrics indexes?

heavy forwarder

HTTP Event Collector

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM