Hi,
It seems log file contains CTRL-M character will cause duplicate parsing in splunk indexer so I would like to filter all events contain this character. Please advise on how to set it up in props.conf
Thanks in advance!
e.g.
<?xml version="1.0" encoding="ISO-8859-1"?>^M
<DATA>^M
The Carriage Return character (^M) does not cause duplicate indexing with Splunk, so you likely have some other problem.
The way splunk handles lines is to split lines on any sequence of carriage returns or linefeeds (also called a newline). These characters are sometimes written ^M and ^L, or CR and LF or NL . Splunk doesn't care which you have in the file, it will just linebreak any any quantity of sequential characters of either type.
Therefore, unless you have a custom LINEBREAKER setting, these characters are gone by the time we get to event merging and so on.
Meanwhile, the strategy used by the tailing processor which reads logfiles doesn't care about the particular bytes you have in your files. It just reads chunks of bytes and hashes the start and end of the file. Most likely, since your file is xml, the end of the file is being rewritten by replacing the close tag with event text.
In other words your file is probably going from:
...
<event10>event text</event10>
</data>
to
...
<event10>event text</event10>
<event11>event text</event11>
</data>
Since the bytes for the data closetag get replaced with the bytes for event eleven, the hash for the file end changes and the file is considered to contain new content.
Workarounds involve monitoring the file after it is complete, or modifying the application to not write out a close tag until the logfile is complete and will no longer be written to.
The Carriage Return character (^M) does not cause duplicate indexing with Splunk, so you likely have some other problem.
The way splunk handles lines is to split lines on any sequence of carriage returns or linefeeds (also called a newline). These characters are sometimes written ^M and ^L, or CR and LF or NL . Splunk doesn't care which you have in the file, it will just linebreak any any quantity of sequential characters of either type.
Therefore, unless you have a custom LINEBREAKER setting, these characters are gone by the time we get to event merging and so on.
Meanwhile, the strategy used by the tailing processor which reads logfiles doesn't care about the particular bytes you have in your files. It just reads chunks of bytes and hashes the start and end of the file. Most likely, since your file is xml, the end of the file is being rewritten by replacing the close tag with event text.
In other words your file is probably going from:
...
<event10>event text</event10>
</data>
to
...
<event10>event text</event10>
<event11>event text</event11>
</data>
Since the bytes for the data closetag get replaced with the bytes for event eleven, the hash for the file end changes and the file is considered to contain new content.
Workarounds involve monitoring the file after it is complete, or modifying the application to not write out a close tag until the logfile is complete and will no longer be written to.
Thank you very much for the detail explanation.
I agree with your point -- carriage Return character (^M) does not cause duplicate indexing
Duplicate event index seems to be related to something else.
I found the error on forwarder's splunkd.log
11-13-2014 10:51:01.063 -0500 INFO WatchedFile - Checksum for seekptr didn't match, will re-read entire file='/local/0/lns/home/prod/log/jobs.log'.
11-13-2014 10:51:01.063 -0500 INFO WatchedFile - Will begin reading at offset=0 for file='/local/0/lns/home/prod/log/jobs.log'.
So I tried to modify inputs.conf as below but the issue of duplicate events still persist. Do you have any insight?
[monitor:///local/0/lns/home/prod/log/jobs.log]
host = nylnslxprd01
sourcetype = app_log
index = rsch_app
crcSalt=
followTail = 1
It sure looks like the problem I expected is happening. "seekptr didn't match" means one of two things:
Given that it's an XML file, 2 is by far the most probable as described above.
Log file is rotated so that's why the checksum is not the same.
Any reason using the attribute crcSalt is not working ???
[monitor:///local/0/lns/home/prod/log/jobs.log]
sourcetype = app_log
index = rsch_app
crcSalt=<SOURCE>
followTail = 1
Thanks again for the information. I think it's number 2 -- the file bytes are being changed after they are written.
The log file is updated frequently roughly 20 events per second.
Is that the root cause?
If the answer is yes, any remedy?
2014-11-20 11:44:59,029
2014-11-20 11:44:59,065
2014-11-20 11:44:59,070
2014-11-20 11:44:59,071
2014-11-20 11:44:59,377
2014-11-20 11:44:59,396
2014-11-20 11:44:59,396
2014-11-20 11:44:59,543
2014-11-20 11:44:59,573
2014-11-20 11:44:59,578
2014-11-20 11:44:59,578
2014-11-20 11:44:59,886
Here sis the stanza defined in props.conf on the indexer
[app_log]
TZ = 'America/New_York'
NO_BINARY_CHECK = 1
pulldown_type = 1
BREAK_ONLY_BEFORE = \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d{3}
TIME_FORMAT = %Y-%m-%d %H:%M:%S
As an aside, I usually see the ^M character show up in files written by Windows systems and transferred to Unix. I get rid of those characters using a dos2unix command (i.e. dos2unix ctrlmfile newfile). Is it possible to do that before you index the file?
If not, if you are planning to discard the events with ^M characters totally, then you'll need to employ a props.conf/transforms.conf config change to route these items to nullQueue. See the following answer which may help guide you.
http://answers.splunk.com/answers/108326/regex-and-nullqueue-problem.html
Indeed, tossing the "header" lines is a reasonable thing to do.
Heh, I wasn't even paying attention to what the example data was. Yeah, you probably don't want to remove that line totally. A SEDCMD regex could be used to zap the ^M characters without burning the whole line from the event. But, your other answer below looks like a good route to explore first.
The main thrust is that the ^M characters are probably gone before you can try to zap them.