Getting Data In

How to index only one day of data from a batch output?

antonyhan
Path Finder

I have a situation to index batch output into Splunk.

The output looks like:
/data/20160711/file.log <---a
/data/20160712/file.log <---b
/data/20160713/file.log <---c

Every day, the batch job copies the file.log into this subfolder under /data. But file.log is not rotated by date, but by size, which means if not rotated, file.log a,b,c could have duplicate data.
Tight now I have a forwarder monitoring everything under /data, but it caused quite some duplication.

What's the best way to just index the data for the day from each file?

0 Karma
1 Solution

twinspop
Influencer

If the order of the files doesn't change, removing crcSalt = <SOURCE> should fix your problem. If the top of the file is changing when it rolls to a new directory, there is nothing you can do with Splunk natively to fix this problem.

View solution in original post

twinspop
Influencer

If the order of the files doesn't change, removing crcSalt = <SOURCE> should fix your problem. If the top of the file is changing when it rolls to a new directory, there is nothing you can do with Splunk natively to fix this problem.

lycollicott
Motivator

Is there any reason you cannot just monitor the file directly from its original location? That would be the simplest approach.

0 Karma

antonyhan
Path Finder

If we could, we would...
right now we can only rely on this batch process to pull the data from 1000+ workstations to a central location.

0 Karma

lycollicott
Motivator

Install the universal forwarder on the workstations or change the batch job to rotate by date. The latter is probably simpler.

0 Karma

antonyhan
Path Finder

it involves 4 parties to do it....believe me I am trying....

0 Karma

twinspop
Influencer

You should share your inputs.conf stanza for these files

0 Karma

antonyhan
Path Finder

For now it's very simple

[monitor:///data/.../file.log]
index = myIndex
sourcetype = myType
crcSalt = [SOURCE]

0 Karma

twinspop
Influencer

crcSalt is causing your problem. You do not want to use it in a case like this.

0 Karma

antonyhan
Path Finder

how so please?

/data/20160711/file.log <---a
with events:
day1.0
day1.1
/data/20160712/file.log <---b
with events:
day1.0
day1.1
day2.0
day2.1
day2.2
/data/20160713/file.log <---c
with events:
day1.0
day1.1
day2.0
day2.1
day2.2
day3.0
day3.1

My goal is to only index
day1.0
day1.1
day2.0
day2.1
day2.2
day3.0
day3.1

but what am I getting on the 3rd is:
day1.0
day1.0
day1.0
day1.1
day1.1
day1.1
day2.0
day2.0
day2.1
day2.1
day2.2
day3.0
day3.1

0 Karma

twinspop
Influencer

That's (at least partly) because crcSalt uses the path of the file to determine the hash that represents the file. Every time you move it to a new directory, splunk says "hmm, the hash has changed! new file! eat it up!" Normally, splunk only considers the first 256 bytes of a file to ID it. If that is the same, it will check mod time, seek to the last read spot, and input from there.

0 Karma

antonyhan
Path Finder

i understand what you meant, but each day batch script will dump a new file to the /data folder. That file will have the history data and new data.

0 Karma

twinspop
Influencer

I don't think you do. by using crcSalt = <SOURCE> you are guaranteeing that splunk will re-read the entire file. Period. You should not be using it.

0 Karma

antonyhan
Path Finder

what value would you suggest to use?
without crcSalt, splunk will ignore the file.log 2nd time it sees it because they have the same CRC value.
From http://docs.splunk.com/Documentation/Splunk/6.4.1/Admin/Inputsconf
(Splunk only performs CRC checks against, by default, the first 256 bytes
a file. This behavior prevents Splunk from indexing the same file twice,
even though you may have renamed it -- as, for example, with rolling log
files. However, because the CRC is based on only the first few lines of
the file, it is possible for legitimately different files to have matching
CRCs, particularly if they have identical headers.)

0 Karma

twinspop
Influencer

it won't ignore the file. It will treat is as an already seen file. If the file's timestamp changes, it will seek to the latest point it saw previously, and read from there.

0 Karma

antonyhan
Path Finder

You are absolutely right!!
just tested out and it works!
It would only keep the first copy Splunk sees as the source, but it doesn't matter in my situation!

Thanks a lot TWINSPOP!!

0 Karma

twinspop
Influencer

Excellent! Glad I could help.

0 Karma
Get Updates on the Splunk Community!

Splunk Smartness with Brandon Sternfield | Episode 3

Hello and welcome to another episode of "Splunk Smartness," the interview series where we explore the power of ...

Monitoring Postgres with OpenTelemetry

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...