Getting Data In

Duplicate data problem

Builder

Hi

I have the following configuration in inputs.conf:

[monitor:///<directory>]
index=results
crcSalt = <SOURCE>
sourcetype = results

My intend was to input data based on the location of the data. But the following command displays duplicates with the same source (location).

... | stats count by source

I want to know how to fix this problem.
Output:

source:                             count
 <directory>/filename1     2
 <directory>/filename2     2
 <directory>/filename3     2
 <directory>/filename4     2

Edit:
There is a workaround, but undesirable because I still have duplicate data.

Workaround:

... | dedup source 
0 Karma

Esteemed Legend

Find any outputs.conf files on your server (which, BTW, is a forwarder) and shows us what is inside them (and where they are). Let's say you have 2 indexers and you have configured to send the same events to each indexer separately. This would cause this problem. You can get more insight on this by modifying your test search to this:

 ... | stats dc(splunk_server) count by source 
0 Karma

Builder

I have only four files of outputs.conf:

find ./ -name "outputs.conf"
/etc/modules/distributedDeployment/classes/deployable/outputs.conf
/etc/system/default/outputs.conf
/etc/apps/SplunkLightForwarder/default/outputs.conf
/etc/apps/SplunkForwarder/default/outputs.conf

file at .../classes/deployable:

[tcpout]
disabled=false
# Replace 'YourDeploymentServerHostname' with the ip-address where your deployment server is running.
[tcpout:RouteMetricsToDeploymentServer]
disabled=false
server=YourDeploymentServerHostname:9997

File at /SplunkForwarder/default:

[tcpout]
maxQueueSize = 500kb
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_introspection)
forwardedindex.filter.disable = false

File at /SplunkLightForwarder/default:

[tcpout]
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_introspection)
forwardedindex.filter.disable = false

File at .../system/default.

[tcpout]
maxQueueSize = auto
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_internal|_introspection)
forwardedindex.filter.disable = false
indexAndForward = false
autoLBFrequency = 30
blockOnCloning = true
compressed = false
disabled = false 
dropClonedEventsOnQueueFull = 5
dropEventsOnQueueFull = -1
heartbeatFrequency = 30
maxFailuresPerInterval = 2
secsInFailureInterval = 1
maxConnectionsPerIndexer = 2
forceTimebasedAutoLb = false
sendCookedData = true
connectionTimeout = 20
readTimeout = 300
writeTimeout = 300
tcpSendBufSz = 0
ackTimeoutOnShutdown = 30
useACK = false
blockWarnThreshold = 100
sslQuietShutdown = false

[syslog]
type = udp
priority = <13>
dropEventsOnQueueFull = -1
maxEventSize = 1024

... | stats dc(splunk_server) count by source output:

 source:                                  dc(splunk_server)                count
  <directory>/filename1       1                                        2
  <directory>/filename2       1                                        2
  <directory>/filename3       1                                        2
  <directory>/filename4       1                                        2

All dc(splunk_server) values are 1 and I haven't made any change in any of those outputs.conf files.

0 Karma

Communicator

Do you need to include the crcSalt = ? Best practice is to use it only as needed and not leave it set.
Was it always there or did you add it?
That is likely causing the date to be reindexed if the file name is the same.
Try:
your search | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S")| stats count by source, indextime

Builder

Hi

I included crcSalt because all the files are very similar and if Splunk thinks they are the same they will not be indexed in Splunk. crcSalt makes sure that all files with different source(location) are indexed into Splunk. Also if I disable crcSalt then new files that are added to the directory will not be indexed.

... | your command output:
  source:                                  indextime                              count
   <directory>/filename1       2015-10-14 14:48:14           1
   <directory>/filename1       2015-10-16 10:27:25           1
   <directory>/filename2       2015-10-14 14:48:14           1
   <directory>/filename2       2015-10-16 10:27:25           1

The output showed that those files were re-indexed the next day causing the problem. I remembered that day I added the crcSalt configuration because I wasn't able to index all the files because of their similarity. Once I added the configuration all files were indexed. Looks like Splunk re-indexed all files even though there were files already indexed with the same SOURCE value.

This means that Splunk will ignored whatever is already indexed if the inputs.conf file is changed. Thanks for your help. Now, how could I solve this issue?

0 Karma

SplunkTrust
SplunkTrust

Hi edrivera3, some possible explanations:

  • Your files have two identical events
  • You have two forwarders indexing the same file that has one event
  • You have indexing acknowledgement turned on and splunk re-forwarded the event after timeout on ack from indexer.

Let me know if this helps!

0 Karma

Builder
  1. There is only one event per file.
  2. I'm not using forwarders ( I'm just monitoring a directory in the server)
  3. I don't know what indexing acknowledgement is, but I'm not forwarding anything.
0 Karma

SplunkTrust
SplunkTrust

Are you saying that ... | stats count by source shows that more than one row appears to have the same value for source? That is kind of impossible, due to the nature of stats. So if that is what you're seeing, I suspect there is some tiny tiny difference, possibly as tiny as one of them somehow ended up with a space character after them. Can you click them each to drill down and see what the searchterms yielded are?

0 Karma

Builder

Well it is possible. The command is showing events with the same source(location).

The results of the output:

source:                                   count
<directory>/filename1     2
<directory>/filename2     2
<directory>/filename3     2
<directory>/filename4     2
0 Karma

SplunkTrust
SplunkTrust

Ah that makes more sense. Sorry I didn't realize that this sourcetype is configured to have the entire file indexed as one event. Muebel's answer has the way to proceed with troubleshooting.

0 Karma