Getting Data In

why am I seeing duplicate events in my metrics indexes?

rphillips_splk
Splunk Employee
Splunk Employee

I am seeing duplicate events in a metrics index, help!

 

deployment flow:
hec client--->load balancer--->HFs (hec receivers)--->Indexers (metrics index)

Labels (2)
Tags (1)
1 Solution

rphillips_splk
Splunk Employee
Splunk Employee

note: HF is using useACK=true in outputs.conf which is causing the duplication of events.

root cause:

useACK is not available for events where the _raw field is missing. By design, metrics data does not contain an _raw field.
Do not use useACK=true if you are sending events to a metrics index where the event is missing _raw. useACK is implemented to track _raw. If _raw is not presented, there is no ACK sent back (from indexer to HF) which will cause duplicate events.


produce the problem:

send 1 json event to metrics index:

curl -k https://lb.sv.splunk.com:8088/services/collector \
-H "Authorization: Splunk <token>" \
-d '{"time": 1614193927.000,"source":"disk","host":"host_77","fields":{"region":"us-west-1","datacenter":"us-west-1a","rack":"63","os":"Ubuntu16.10","arch":"x64","team":"LON","service":"6","service_version":"0","service_environment":"test","path":"/dev/sda1","fstype":"ext3","_value":1099511627776,"metric_name":"total"}}'

wait 5m and search the metric index and see the duplicate event

| msearch index="mymetrics"


see the ACK timeout and duplication because indexers never send back ACK so the HF sends the event again and again every 300s.

ie:
HF splunkd.log:

02-24-2021 14:47:49.702 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:47:49.703 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997

02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997

 


solution:
1.) create two output groups on the HF, one for event data and one for metric data. For the outputgroup for metric data set useACK=false
ie: HF:
outputs.conf:

[tcpout]
defaultGroup = clustered_indexers_with_useACK

[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true

[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false


2.) on specific http input stanzas where metric data is received, use outputgroup attribute to send the the outputgroup where useACK=false

HF inputs.conf:

[http://<name>]
outputgroup = <string>
* The name of the forwarding output group to send data to.
* Default: empty string

example HF:
inputs.conf

[http]
busyKeepAliveIdleTimeout = 90
dedicatedIoThreads = 2
disabled = 0
enableSSL = 1
port = 8088
queueSize = 2MB


[http://metrics_data]
disabled = 0
host = hf1
index = mymetrics
indexes = mymetrics
sourcetype = _json
token = <token>
outputgroup = clustered_indexers_without_useACK

[http://event_data]
disabled = 0
host = hf1
index = main
indexes = main
sourcetype = _json
token = <token>


Since defaultGroup= clustered_indexers_with_useACK , if we don't specify any outputgroup the data gets sent to the default outputgroup so we dont need to declare it in the [http://event_data] input stanza.

 

 

View solution in original post

Tags (1)

rphillips_splk
Splunk Employee
Splunk Employee

note: HF is using useACK=true in outputs.conf which is causing the duplication of events.

root cause:

useACK is not available for events where the _raw field is missing. By design, metrics data does not contain an _raw field.
Do not use useACK=true if you are sending events to a metrics index where the event is missing _raw. useACK is implemented to track _raw. If _raw is not presented, there is no ACK sent back (from indexer to HF) which will cause duplicate events.


produce the problem:

send 1 json event to metrics index:

curl -k https://lb.sv.splunk.com:8088/services/collector \
-H "Authorization: Splunk <token>" \
-d '{"time": 1614193927.000,"source":"disk","host":"host_77","fields":{"region":"us-west-1","datacenter":"us-west-1a","rack":"63","os":"Ubuntu16.10","arch":"x64","team":"LON","service":"6","service_version":"0","service_environment":"test","path":"/dev/sda1","fstype":"ext3","_value":1099511627776,"metric_name":"total"}}'

wait 5m and search the metric index and see the duplicate event

| msearch index="mymetrics"


see the ACK timeout and duplication because indexers never send back ACK so the HF sends the event again and again every 300s.

ie:
HF splunkd.log:

02-24-2021 14:47:49.702 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:47:49.703 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997

02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Read operation timed out expecting ACK from 10.10.10.1:9997 in 300 seconds.
02-24-2021 14:52:51.498 -0500 WARN TcpOutputProc - Possible duplication of events with channel=source::disk|host::host_xx-withACK|_json|, streamId=0, offset=0 on host=10.10.10.1:9997

 


solution:
1.) create two output groups on the HF, one for event data and one for metric data. For the outputgroup for metric data set useACK=false
ie: HF:
outputs.conf:

[tcpout]
defaultGroup = clustered_indexers_with_useACK

[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true

[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false


2.) on specific http input stanzas where metric data is received, use outputgroup attribute to send the the outputgroup where useACK=false

HF inputs.conf:

[http://<name>]
outputgroup = <string>
* The name of the forwarding output group to send data to.
* Default: empty string

example HF:
inputs.conf

[http]
busyKeepAliveIdleTimeout = 90
dedicatedIoThreads = 2
disabled = 0
enableSSL = 1
port = 8088
queueSize = 2MB


[http://metrics_data]
disabled = 0
host = hf1
index = mymetrics
indexes = mymetrics
sourcetype = _json
token = <token>
outputgroup = clustered_indexers_without_useACK

[http://event_data]
disabled = 0
host = hf1
index = main
indexes = main
sourcetype = _json
token = <token>


Since defaultGroup= clustered_indexers_with_useACK , if we don't specify any outputgroup the data gets sent to the default outputgroup so we dont need to declare it in the [http://event_data] input stanza.

 

 

Tags (1)

rphillips_splk
Splunk Employee
Splunk Employee

deployment flow:
client--->load balancer--->HFs--->Indexers

0 Karma

rphillips_splk
Splunk Employee
Splunk Employee

note:
if you are trying to set the output group to your token via splunk web UI you may notice that no output groups show up.
Settings>Data>Inputs>HTTP Event Collect > New Token

You must configure the groups for disabled = false in outputs.conf for them to appear in the UI.

 

ie:

[tcpout]
defaultGroup = clustered_indexers_with_useACK
disabled = false

[tcpout:clustered_indexers_with_useACK]
server=idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = true
disabled = false

[tcpout:clustered_indexers_without_useACK]
server = idx1.splunk.com:9997,idx2.splunk.com:9997,idx3.splunk.com:9997
useACK = false
disabled = false

 

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...