<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Duplicate data problem in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204682#M40386</link>
    <description>&lt;P&gt;Hi edrivera3, some possible explanations:&lt;/P&gt;

&lt;UL&gt;
&lt;LI&gt;Your files have two identical events&lt;/LI&gt;
&lt;LI&gt;You have two forwarders indexing the same file that has one event&lt;/LI&gt;
&lt;LI&gt;You have indexing acknowledgement turned on and splunk re-forwarded the event after timeout on ack from indexer.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;Let me know if this helps!&lt;/P&gt;</description>
    <pubDate>Fri, 23 Oct 2015 19:03:18 GMT</pubDate>
    <dc:creator>muebel</dc:creator>
    <dc:date>2015-10-23T19:03:18Z</dc:date>
    <item>
      <title>Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204679#M40383</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;

&lt;P&gt;I have the following configuration in inputs.conf:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[monitor:///&amp;lt;directory&amp;gt;]
index=results
crcSalt = &amp;lt;SOURCE&amp;gt;
sourcetype = results
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;My intend was to input data based on the location of the data. But the following command displays duplicates with the same source (location). &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;... | stats count by source
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;I want to know how to fix this problem.&lt;BR /&gt;
Output:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;source:                             count
 &amp;lt;directory&amp;gt;/filename1     2
 &amp;lt;directory&amp;gt;/filename2     2
 &amp;lt;directory&amp;gt;/filename3     2
 &amp;lt;directory&amp;gt;/filename4     2
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Edit:&lt;BR /&gt;
There is a workaround, but undesirable because I still have duplicate data. &lt;/P&gt;

&lt;P&gt;Workaround:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;... | dedup source 
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 23 Oct 2015 15:34:52 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204679#M40383</guid>
      <dc:creator>edrivera3</dc:creator>
      <dc:date>2015-10-23T15:34:52Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204680#M40384</link>
      <description>&lt;P&gt;Are you saying that &lt;CODE&gt;... | stats count by source&lt;/CODE&gt; shows that more than one row appears to have the same value for source?   That is kind of impossible, due to the nature of stats.  So if that is what you're seeing, I suspect there is some tiny tiny difference, possibly as tiny as one of them somehow ended up with a space character after them.  Can you click them each to drill down and see what the searchterms yielded are? &lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 15:56:56 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204680#M40384</guid>
      <dc:creator>sideview</dc:creator>
      <dc:date>2015-10-23T15:56:56Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204681#M40385</link>
      <description>&lt;P&gt;Well it is possible. The command is showing events with the same source(location).&lt;/P&gt;

&lt;P&gt;The results of the output:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;source:                                   count
&amp;lt;directory&amp;gt;/filename1     2
&amp;lt;directory&amp;gt;/filename2     2
&amp;lt;directory&amp;gt;/filename3     2
&amp;lt;directory&amp;gt;/filename4     2
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 23 Oct 2015 16:19:35 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204681#M40385</guid>
      <dc:creator>edrivera3</dc:creator>
      <dc:date>2015-10-23T16:19:35Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204682#M40386</link>
      <description>&lt;P&gt;Hi edrivera3, some possible explanations:&lt;/P&gt;

&lt;UL&gt;
&lt;LI&gt;Your files have two identical events&lt;/LI&gt;
&lt;LI&gt;You have two forwarders indexing the same file that has one event&lt;/LI&gt;
&lt;LI&gt;You have indexing acknowledgement turned on and splunk re-forwarded the event after timeout on ack from indexer.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;Let me know if this helps!&lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 19:03:18 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204682#M40386</guid>
      <dc:creator>muebel</dc:creator>
      <dc:date>2015-10-23T19:03:18Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204683#M40387</link>
      <description>&lt;P&gt;Ah that makes more sense.  Sorry I didn't realize that this sourcetype is configured to have the entire file indexed as one event.   Muebel's answer has the way to proceed with troubleshooting. &lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 19:07:30 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204683#M40387</guid>
      <dc:creator>sideview</dc:creator>
      <dc:date>2015-10-23T19:07:30Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204684#M40388</link>
      <description>&lt;OL&gt;
&lt;LI&gt;There is only one event per file.&lt;/LI&gt;
&lt;LI&gt;I'm not using forwarders ( I'm just monitoring a directory in the server)&lt;/LI&gt;
&lt;LI&gt;I don't know what indexing acknowledgement is, but I'm not forwarding anything.&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Fri, 23 Oct 2015 20:08:18 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204684#M40388</guid>
      <dc:creator>edrivera3</dc:creator>
      <dc:date>2015-10-23T20:08:18Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204685#M40389</link>
      <description>&lt;P&gt;Do you need to include the crcSalt = ? Best practice is to use it only as needed and not leave it set. &lt;BR /&gt;
Was it always there or did you add it?&lt;BR /&gt;
That is likely causing the date to be reindexed if the file name is the same.&lt;BR /&gt;
Try:&lt;BR /&gt;
your search | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S")| stats count by source, indextime&lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 23:54:17 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204685#M40389</guid>
      <dc:creator>mtranchita</dc:creator>
      <dc:date>2015-10-23T23:54:17Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204686#M40390</link>
      <description>&lt;P&gt;Find any &lt;CODE&gt;outputs.conf&lt;/CODE&gt; files on your server (which, BTW, is a forwarder) and shows us what is inside them (and where they are).  Let's say you have 2 indexers and you have configured to send the same events to each indexer separately.  This would cause this problem.  You can get more insight on this by modifying your test search to this:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt; ... | stats dc(splunk_server) count by source 
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sat, 24 Oct 2015 15:27:13 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204686#M40390</guid>
      <dc:creator>woodcock</dc:creator>
      <dc:date>2015-10-24T15:27:13Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204687#M40391</link>
      <description>&lt;P&gt;I have only four files of outputs.conf:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;find ./ -name "outputs.conf"
/etc/modules/distributedDeployment/classes/deployable/outputs.conf
/etc/system/default/outputs.conf
/etc/apps/SplunkLightForwarder/default/outputs.conf
/etc/apps/SplunkForwarder/default/outputs.conf
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;file at .../classes/deployable:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[tcpout]
disabled=false
# Replace 'YourDeploymentServerHostname' with the ip-address where your deployment server is running.
[tcpout:RouteMetricsToDeploymentServer]
disabled=false
server=YourDeploymentServerHostname:9997
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;File at /SplunkForwarder/default:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[tcpout]
maxQueueSize = 500kb
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_introspection)
forwardedindex.filter.disable = false
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;File at /SplunkLightForwarder/default:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[tcpout]
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_introspection)
forwardedindex.filter.disable = false
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;File at .../system/default. &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[tcpout]
maxQueueSize = auto
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_internal|_introspection)
forwardedindex.filter.disable = false
indexAndForward = false
autoLBFrequency = 30
blockOnCloning = true
compressed = false
disabled = false 
dropClonedEventsOnQueueFull = 5
dropEventsOnQueueFull = -1
heartbeatFrequency = 30
maxFailuresPerInterval = 2
secsInFailureInterval = 1
maxConnectionsPerIndexer = 2
forceTimebasedAutoLb = false
sendCookedData = true
connectionTimeout = 20
readTimeout = 300
writeTimeout = 300
tcpSendBufSz = 0
ackTimeoutOnShutdown = 30
useACK = false
blockWarnThreshold = 100
sslQuietShutdown = false

[syslog]
type = udp
priority = &amp;lt;13&amp;gt;
dropEventsOnQueueFull = -1
maxEventSize = 1024
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;CODE&gt;... | stats dc(splunk_server) count by source&lt;/CODE&gt; output:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt; source:                                  dc(splunk_server)                count
  &amp;lt;directory&amp;gt;/filename1       1                                        2
  &amp;lt;directory&amp;gt;/filename2       1                                        2
  &amp;lt;directory&amp;gt;/filename3       1                                        2
  &amp;lt;directory&amp;gt;/filename4       1                                        2
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;All dc(splunk_server) values are 1 and I haven't made any change in any of those outputs.conf files.&lt;/P&gt;</description>
      <pubDate>Mon, 26 Oct 2015 21:15:55 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204687#M40391</guid>
      <dc:creator>edrivera3</dc:creator>
      <dc:date>2015-10-26T21:15:55Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data problem</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204688#M40392</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;

&lt;P&gt;I included crcSalt because all the files are  very similar and if Splunk thinks they are the same they will not be indexed in Splunk. crcSalt makes sure that all files with different source(location) are indexed into Splunk. Also if I disable crcSalt then new files that are added to the directory will not be indexed.&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;... | your command output:
  source:                                  indextime                              count
   &amp;lt;directory&amp;gt;/filename1       2015-10-14 14:48:14           1
   &amp;lt;directory&amp;gt;/filename1       2015-10-16 10:27:25           1
   &amp;lt;directory&amp;gt;/filename2       2015-10-14 14:48:14           1
   &amp;lt;directory&amp;gt;/filename2       2015-10-16 10:27:25           1
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The output showed that those files were re-indexed the next day causing the problem. I remembered that day I added the crcSalt configuration because I wasn't able to index all the files because of their similarity. Once I added the configuration all files were indexed. Looks like Splunk re-indexed all files even though there were files already indexed with the same SOURCE value. &lt;/P&gt;

&lt;P&gt;This means that Splunk will ignored whatever is already indexed if the inputs.conf file is changed. Thanks for your help. Now, how could I solve this issue? &lt;/P&gt;</description>
      <pubDate>Mon, 26 Oct 2015 21:44:42 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Duplicate-data-problem/m-p/204688#M40392</guid>
      <dc:creator>edrivera3</dc:creator>
      <dc:date>2015-10-26T21:44:42Z</dc:date>
    </item>
  </channel>
</rss>

