Why is data getting duplicated?

puneethgowda · ‎12-05-2016

Hi ,

We have noticed an issue in my Splunk environment:

Issue:

Data is getting duplicated twice in indexers. If i do a search in search head, the same events are coming in twice. this issue started today, earlier there is no issue with the data.

My Investigations:

1) Checked the application logs whether same log is existing twice. Answer: No
2) Checked whether this issue is happening to one sourcetype OR only for one index. Answer: No it is affecting all indexers data.

My questions:

Any other reason why this is happening? And what are the steps needed to prevent it?

Thanks in advance.

Regards,
Puneeth

dkolekar_splunk · ‎03-04-2019

In case of duplicate issues, we need to check the following:

Whether the source file contains duplicate events
If mistakenly two inputs.conf are configured in splunk or two forwarders
The original application may send the same data intentionally to two different channels (eg two files)
Behavior where the forwarder is convinced to read a file multiple times, such as an explicit fishbucket reset, or incorrect use of CRCSalt'
Monitoring the directory with symlink loops
Use of the forwarding ACK system, where network failures are correctly intended to result in small amounts of duplicated data
Use of summary indexing to intentionally duplicate events in splunk
The original application may have a bug which produces the log duplication

The following endpoint lists all files known to the tailing processor along with their status (read, ignored, blacklisted, etc...)
Link: https://[splunkd_hostname]:[splunkd_port]/services/admin/inputstatus/tailingprocessor:filestatus

If you can not able to rectify the issue in the above scenarios, you can enable the DEBUG level using the following components.

TailingProcessor
BatchReader
WatchedFile
FileTracker

To check if the events are duplicated, you can use follwoing SPL,
| eval md=md5(_raw) | stats count by md | where count > 1

For more information, kindly check, community: Troubleshooting Monitor Inputs
Link: https://wiki.splunk.com/Community:Troubleshooting_Monitor_Inputs

gjanders · ‎12-05-2016

You have mentioned that all your data is getting duplicated, this sounds like a misconfigured outputs.conf
Can you confirm how your outputs.conf is configured?

Here's an example with 2 indexers which are in an indexer cluster named indexer 1 and 2, indexer acknowledgement is also turned on, SSL is not in use in this example:
[tcpout]
defaultGroup = allIndexers
disabled = false

[tcpout:allIndexers]
server=indexer1:9997,indexer2:9997
autoLB = true
useACK = true

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

puneethgowda · ‎12-05-2016

#Version 6.5.1
#DO NOT EDIT THIS FILE! #Changes to default files will be lost on update and are difficult to
#manage and support.
#Please make any changes to system defaults by overriding them in
#apps or $SPLUNK_HOME/etc/system/local
#(See "Configuration file precedence" in the web documentation).
#To override a specific setting, copy the name of the stanza and #setting to the file where you wish to override it.

[tcpout]
maxQueueSize = auto
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_internal|_introspection|_telemetry)
forwardedindex.filter.disable = false
indexAndForward = false
autoLBFrequency = 30
blockOnCloning = true
compressed = false
disabled = false
dropClonedEventsOnQueueFull = 5
dropEventsOnQueueFull = -1
heartbeatFrequency = 30
maxFailuresPerInterval = 2
secsInFailureInterval = 1
maxConnectionsPerIndexer = 2
forceTimebasedAutoLB = false
sendCookedData = true
connectionTimeout = 20
readTimeout = 300
writeTimeout = 300
tcpSendBufSz = 0
ackTimeoutOnShutdown = 30
useACK = false
blockWarnThreshold = 100
sslQuietShutdown = false

[syslog]
type = udp
priority = <13>
dropEventsOnQueueFull = -1
maxEventSize = 1024

gjanders · ‎12-05-2016

That is the outputs.conf from the default directory.
Perhaps try:
splunk btool outputs list --debug

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

ddrillic · ‎12-05-2016

I'm looking for a good best practices document about duplicate data... found this so far - What are best practices for handling data in a Splunk staging environment that needs to go to produc...

richgalloway · ‎12-05-2016

For your security, I removed your phone number from the question.

---
If this reply helps you, Karma would be appreciated.

puneethgowda · ‎12-05-2016

thanks you very much

PPape · ‎12-05-2016

Did you check your inputs.conf if there are 2 stanzas pointing to the same source?

puneethgowda · ‎12-05-2016

No 2 stanzas are not pointing to the same source

Why is data getting duplicated?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Announcing Modern Navigation: A New Era of Splunk User Experience

Request for Professional Development: Attending .conf26

Join the Conversation

Why is data getting duplicated?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Announcing Modern Navigation: A New Era of Splunk User Experience

Request for Professional Development: Attending .conf26