Solved: How to index only one day of data from a batch out...

antonyhan · ‎07-14-2016

I have a situation to index batch output into Splunk.

The output looks like:
/data/20160711/file.log <---a
/data/20160712/file.log <---b
/data/20160713/file.log <---c

Every day, the batch job copies the file.log into this subfolder under /data. But file.log is not rotated by date, but by size, which means if not rotated, file.log a,b,c could have duplicate data.
Tight now I have a forwarder monitoring everything under /data, but it caused quite some duplication.

What's the best way to just index the data for the day from each file?

twinspop · ‎07-14-2016

If the order of the files doesn't change, removing crcSalt = <SOURCE> should fix your problem. If the top of the file is changing when it rolls to a new directory, there is nothing you can do with Splunk natively to fix this problem.

View solution in original post

twinspop · ‎07-14-2016

If the order of the files doesn't change, removing crcSalt = <SOURCE> should fix your problem. If the top of the file is changing when it rolls to a new directory, there is nothing you can do with Splunk natively to fix this problem.

lycollicott · ‎07-14-2016

Is there any reason you cannot just monitor the file directly from its original location? That would be the simplest approach.

antonyhan · ‎07-14-2016

If we could, we would...
right now we can only rely on this batch process to pull the data from 1000+ workstations to a central location.

lycollicott · ‎07-14-2016

Install the universal forwarder on the workstations or change the batch job to rotate by date. The latter is probably simpler.

antonyhan · ‎07-14-2016

it involves 4 parties to do it....believe me I am trying....

twinspop · ‎07-14-2016

You should share your inputs.conf stanza for these files

antonyhan · ‎07-14-2016

For now it's very simple

[monitor:///data/.../file.log]
index = myIndex
sourcetype = myType
crcSalt = [SOURCE]

twinspop · ‎07-14-2016

crcSalt is causing your problem. You do not want to use it in a case like this.

antonyhan · ‎07-14-2016

how so please?

/data/20160711/file.log <---a
with events:
day1.0
day1.1
/data/20160712/file.log <---b
with events:
day1.0
day1.1
day2.0
day2.1
day2.2
/data/20160713/file.log <---c
with events:
day1.0
day1.1
day2.0
day2.1
day2.2
day3.0
day3.1

My goal is to only index
day1.0
day1.1
day2.0
day2.1
day2.2
day3.0
day3.1

but what am I getting on the 3rd is:
day1.0
day1.0
day1.0
day1.1
day1.1
day1.1
day2.0
day2.0
day2.1
day2.1
day2.2
day3.0
day3.1

twinspop · ‎07-14-2016

That's (at least partly) because crcSalt uses the path of the file to determine the hash that represents the file. Every time you move it to a new directory, splunk says "hmm, the hash has changed! new file! eat it up!" Normally, splunk only considers the first 256 bytes of a file to ID it. If that is the same, it will check mod time, seek to the last read spot, and input from there.

antonyhan · ‎07-14-2016

i understand what you meant, but each day batch script will dump a new file to the /data folder. That file will have the history data and new data.

twinspop · ‎07-14-2016

I don't think you do. by using crcSalt = <SOURCE> you are guaranteeing that splunk will re-read the entire file. Period. You should not be using it.

antonyhan · ‎07-14-2016

what value would you suggest to use?
without crcSalt, splunk will ignore the file.log 2nd time it sees it because they have the same CRC value.
From http://docs.splunk.com/Documentation/Splunk/6.4.1/Admin/Inputsconf
(Splunk only performs CRC checks against, by default, the first 256 bytes
a file. This behavior prevents Splunk from indexing the same file twice,
even though you may have renamed it -- as, for example, with rolling log
files. However, because the CRC is based on only the first few lines of
the file, it is possible for legitimately different files to have matching
CRCs, particularly if they have identical headers.)

twinspop · ‎07-14-2016

it won't ignore the file. It will treat is as an already seen file. If the file's timestamp changes, it will seek to the latest point it saw previously, and read from there.

antonyhan · ‎07-14-2016

You are absolutely right!!
just tested out and it works!
It would only keep the first copy Splunk sees as the source, but it doesn't matter in my situation!

Thanks a lot TWINSPOP!!

twinspop · ‎07-14-2016

Excellent! Glad I could help.

How to index only one day of data from a batch output?

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Splunk Developers: Go Beyond the Dashboard with These .Conf25 Sessions

Index This | How do you write 23 only using the number 2?

Are you a member of the Splunk Community?

How to index only one day of data from a batch output?

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Splunk Developers: Go Beyond the Dashboard with These .Conf25 Sessions

Index This | How do you write 23 only using the number 2?