Getting Data In

Duplicate data because of file parts

strive
Influencer

Hi,

I took 6 log files. The sum of events from all the log files is 10666.

I added the log files into my forwarder node.

When i checked the index: index=my_raw_index The data showed was 21332. Double of actual count.

When i checked the source, there are 12 sources instead of 6. Some source types are fileparts of actual log file. Its like mylog.log-20130514.filepart

If i run a query: index=my_raw_index | where like(source, "%20130514"), it gives me 10666.

Since the file size is huge, it took sometime for the log files to get copied completely.

I have following settings in inputs.conf file of forwarder node.

[monitor:///home/data/aaa/bbb/*]
disabled = false
sourcetype = bbb_ccc
index = my_raw_index
crcSalt = SOURCE

Note: For crcSalt, angular brackets are there. Here i took it out since nothing was getting displayed.

How to avoid this filepart indexing. What settings should be enabled so that data is not indexed twice.

Thanks

Strive

Tags (1)
0 Karma
1 Solution

the_wolverine
Champion
  1. You most likely should be blacklisting the *.filepart files since they are partial files. You can do this by adding the following line to your monitor stanza:

blacklist = \.(filepart)$

  1. Remove "crcSalt = SOURCE".

  2. You'll need to re-index those log files as Splunk has already seen them and will not re-index them unless you do something like clean the index (if that's possible on this index.)

View solution in original post

strive
Influencer

The combined size of 6 log files is 4.5 MB.

In production the combined size of log files would be around 8 MB

0 Karma

the_wolverine
Champion
  1. You most likely should be blacklisting the *.filepart files since they are partial files. You can do this by adding the following line to your monitor stanza:

blacklist = \.(filepart)$

  1. Remove "crcSalt = SOURCE".

  2. You'll need to re-index those log files as Splunk has already seen them and will not re-index them unless you do something like clean the index (if that's possible on this index.)

strive
Influencer

Thanks a lot kristian. Your suggestion makes sense to copy the files to temp folder first.

0 Karma

BansodeSantosh
Explorer

This solution worked...thanks #the_wolverine

0 Karma

kristian_kolb
Ultra Champion

Did you clean out the fishbucket as well? Unless you do so, Splunk will not re-index the files.

That is an index (which can be cleaned) where splunk stores what it has already seen (files, offset-pointers). Beware though that if you clean this, splunk will re-index any file it's been configured to monitor (if they're still there).

Oh, for reasons that you've just experienced, you should not copy huge files over the network directly into a monitored folder. It's better to copy it to a temp folder (on the same file system) and then move it into the monitored folder.

/k

0 Karma

strive
Influencer

Its working. Thank you.

0 Karma

strive
Influencer

I cleaned the index. Added blacklist = .(filepart)$
I did not remove crcSalt=.

Data is not getting indexed.

0 Karma

the_wolverine
Champion

You should really only use batch for one time read and destruct of your log files. Please refer to the documentation for batch input to confirm if that's what you really want to do.

strive
Influencer

Thank you for your response. I will check your solution.
The combined size of 6 log files is 4.5 MB.

Should i use [batch] rather than monitor in this scenario. Actually in production environment it will be around 7MB.

0 Karma
Get Updates on the Splunk Community!

Customer Experience | Splunk 2024: New Onboarding Resources

In 2023, we were routinely reminded that the digital world is ever-evolving and susceptible to new ...

Celebrate CX Day with Splunk: Take our interactive quiz, join our LinkedIn Live ...

Today and every day, Splunk celebrates the importance of customer experience throughout our product, ...

How to Get Started with Splunk Data Management Pipeline Builders (Edge Processor & ...

If you want to gain full control over your growing data volumes, check out Splunk’s Data Management pipeline ...