Getting Data In

Duplicate data because of file parts

strive
Influencer

Hi,

I took 6 log files. The sum of events from all the log files is 10666.

I added the log files into my forwarder node.

When i checked the index: index=my_raw_index The data showed was 21332. Double of actual count.

When i checked the source, there are 12 sources instead of 6. Some source types are fileparts of actual log file. Its like mylog.log-20130514.filepart

If i run a query: index=my_raw_index | where like(source, "%20130514"), it gives me 10666.

Since the file size is huge, it took sometime for the log files to get copied completely.

I have following settings in inputs.conf file of forwarder node.

[monitor:///home/data/aaa/bbb/*]
disabled = false
sourcetype = bbb_ccc
index = my_raw_index
crcSalt = SOURCE

Note: For crcSalt, angular brackets are there. Here i took it out since nothing was getting displayed.

How to avoid this filepart indexing. What settings should be enabled so that data is not indexed twice.

Thanks

Strive

Tags (1)
0 Karma
1 Solution

the_wolverine
Champion
  1. You most likely should be blacklisting the *.filepart files since they are partial files. You can do this by adding the following line to your monitor stanza:

blacklist = \.(filepart)$

  1. Remove "crcSalt = SOURCE".

  2. You'll need to re-index those log files as Splunk has already seen them and will not re-index them unless you do something like clean the index (if that's possible on this index.)

View solution in original post

strive
Influencer

The combined size of 6 log files is 4.5 MB.

In production the combined size of log files would be around 8 MB

0 Karma

the_wolverine
Champion
  1. You most likely should be blacklisting the *.filepart files since they are partial files. You can do this by adding the following line to your monitor stanza:

blacklist = \.(filepart)$

  1. Remove "crcSalt = SOURCE".

  2. You'll need to re-index those log files as Splunk has already seen them and will not re-index them unless you do something like clean the index (if that's possible on this index.)

strive
Influencer

Thanks a lot kristian. Your suggestion makes sense to copy the files to temp folder first.

0 Karma

BansodeSantosh
Explorer

This solution worked...thanks #the_wolverine

0 Karma

kristian_kolb
Ultra Champion

Did you clean out the fishbucket as well? Unless you do so, Splunk will not re-index the files.

That is an index (which can be cleaned) where splunk stores what it has already seen (files, offset-pointers). Beware though that if you clean this, splunk will re-index any file it's been configured to monitor (if they're still there).

Oh, for reasons that you've just experienced, you should not copy huge files over the network directly into a monitored folder. It's better to copy it to a temp folder (on the same file system) and then move it into the monitored folder.

/k

0 Karma

strive
Influencer

Its working. Thank you.

0 Karma

strive
Influencer

I cleaned the index. Added blacklist = .(filepart)$
I did not remove crcSalt=.

Data is not getting indexed.

0 Karma

the_wolverine
Champion

You should really only use batch for one time read and destruct of your log files. Please refer to the documentation for batch input to confirm if that's what you really want to do.

strive
Influencer

Thank you for your response. I will check your solution.
The combined size of 6 log files is 4.5 MB.

Should i use [batch] rather than monitor in this scenario. Actually in production environment it will be around 7MB.

0 Karma
Get Updates on the Splunk Community!

Cloud Platform | Customer Change Announcement: Email Notification Will Be Available ...

The Notification Team is migrating our email service provider from Postmark to AWS Simple Email ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...