I have a situation to index batch output into Splunk.
The output looks like:
/data/20160711/file.log <---a
/data/20160712/file.log <---b
/data/20160713/file.log <---c
Every day, the batch job copies the file.log into this subfolder under /data. But file.log is not rotated by date, but by size, which means if not rotated, file.log a,b,c could have duplicate data.
Tight now I have a forwarder monitoring everything under /data, but it caused quite some duplication.
What's the best way to just index the data for the day from each file?
If the order of the files doesn't change, removing crcSalt = <SOURCE>
should fix your problem. If the top of the file is changing when it rolls to a new directory, there is nothing you can do with Splunk natively to fix this problem.
If the order of the files doesn't change, removing crcSalt = <SOURCE>
should fix your problem. If the top of the file is changing when it rolls to a new directory, there is nothing you can do with Splunk natively to fix this problem.
Is there any reason you cannot just monitor the file directly from its original location? That would be the simplest approach.
If we could, we would...
right now we can only rely on this batch process to pull the data from 1000+ workstations to a central location.
Install the universal forwarder on the workstations or change the batch job to rotate by date. The latter is probably simpler.
it involves 4 parties to do it....believe me I am trying....
You should share your inputs.conf stanza for these files
For now it's very simple
[monitor:///data/.../file.log]
index = myIndex
sourcetype = myType
crcSalt = [SOURCE]
crcSalt is causing your problem. You do not want to use it in a case like this.
how so please?
/data/20160711/file.log <---a
with events:
day1.0
day1.1
/data/20160712/file.log <---b
with events:
day1.0
day1.1
day2.0
day2.1
day2.2
/data/20160713/file.log <---c
with events:
day1.0
day1.1
day2.0
day2.1
day2.2
day3.0
day3.1
My goal is to only index
day1.0
day1.1
day2.0
day2.1
day2.2
day3.0
day3.1
but what am I getting on the 3rd is:
day1.0
day1.0
day1.0
day1.1
day1.1
day1.1
day2.0
day2.0
day2.1
day2.1
day2.2
day3.0
day3.1
That's (at least partly) because crcSalt uses the path of the file to determine the hash that represents the file. Every time you move it to a new directory, splunk says "hmm, the hash has changed! new file! eat it up!" Normally, splunk only considers the first 256 bytes of a file to ID it. If that is the same, it will check mod time, seek to the last read spot, and input from there.
i understand what you meant, but each day batch script will dump a new file to the /data folder. That file will have the history data and new data.
I don't think you do. by using crcSalt = <SOURCE>
you are guaranteeing that splunk will re-read the entire file. Period. You should not be using it.
what value would you suggest to use?
without crcSalt, splunk will ignore the file.log 2nd time it sees it because they have the same CRC value.
From http://docs.splunk.com/Documentation/Splunk/6.4.1/Admin/Inputsconf
(Splunk only performs CRC checks against, by default, the first 256 bytes
a file. This behavior prevents Splunk from indexing the same file twice,
even though you may have renamed it -- as, for example, with rolling log
files. However, because the CRC is based on only the first few lines of
the file, it is possible for legitimately different files to have matching
CRCs, particularly if they have identical headers.)
it won't ignore the file. It will treat is as an already seen file. If the file's timestamp changes, it will seek to the latest point it saw previously, and read from there.
You are absolutely right!!
just tested out and it works!
It would only keep the first copy Splunk sees as the source, but it doesn't matter in my situation!
Thanks a lot TWINSPOP!!
Excellent! Glad I could help.