topic Re: Avoid duplicate data and ignore # fields in Getting Data In

Avoid duplicate data and ignore # fields

kmattern — Tue, 01 Oct 2013 20:00:08 GMT

I have customer systems that log data to IIS on file transfers. IIS has a timeout of 20 minutes. When it times out it immediately restarts but throws in a new set of headers. Also the date/time stamp on the log changes and Splunk assumes that it is a new file.

How can I avoid the duplication of data when Splunk attempts to re-index the log or how do I get Splunk to only consume the new data? And how do I ignore the headers scattered throughout the log file?

Re: Avoid duplicate data and ignore # fields

lukejadamec — Tue, 01 Oct 2013 21:50:18 GMT

There are two problems here. First, you can remove the extra header lines with additions to inputs.conf, props.conf, and transforms.conf.

Note: I’m using a new sourcetype, so I need a stanza in inputs.conf. If you want to use the existing sourcetype in inputs.conf, then you will need to specify that sourcetype in props.conf (i.e. substitute my winIIS with the sourcetype found in your inputs.conf).

inputs.conf

[monitor://c:\inetpub\logs\Logfiles\W3SVC1\*.log]
sourcetype = winIIS
queue = parsingQueue
index = default
disabled = false

props.conf

[winIIS]
SHOULD_LINEMERGE = false
CHECK_FOR_HEADER = false
REPORT-fields = windows_iis_header
TRANSFORMS-headers = remove_headers

transforms.conf

[remove_headers]
REGEX = ^#.*
DEST_KEY = queue
FORMAT = nullQueue

[winIIS]
FIELDS = “date”,”time”,”s_ip”,….. you need to complete the list with your log header configuration.
DELIMS = “ ”

Here is another example of the same:
http://answers.splunk.com/answers/24986/iis-log-fields-not-parsing

As for the duplication problem, I’ve not seen that. Having the timestamp of the file update is normal, and should not cause a re-read of the file. Splunk hashes the beginning of the file, so if that does not change then it should not be re-read. I’m guessing you have a setting in inputs.conf that is causing it. Can you post your inputs.conf?

Re: Avoid duplicate data and ignore # fields

ogdin — Thu, 13 Feb 2014 16:31:31 GMT

Use INDEXED_EXTRACTIONS=W3C in Splunk 6. We will honor the header found at the top of the file and ignore any line beginning with a # after that. Plus, we do the field extraction automatically from the header so you don't have to mess with props and transforms.

http://docs.splunk.com/Documentation/Splunk/latest/Data/Extractfieldsfromfileheadersatindextime

Re: Avoid duplicate data and ignore # fields

wsnyder2 — Wed, 08 Jun 2016 13:49:49 GMT

We use the following line in the sourcetype stanza for iis in the props.conf file.

SEDCMD-THROWAWAY-COMMENTS=s/^#.+[\r\n]+#.+[\r\n]+#.+[\r\n]+#.*[\r\n]//g