Getting Data In

How to configure props.conf to filter out events with CTRL-M (^M)?

shangshin
Contributor

Hi,
It seems log file contains CTRL-M character will cause duplicate parsing in splunk indexer so I would like to filter all events contain this character. Please advise on how to set it up in props.conf

Thanks in advance!

e.g.

<?xml version="1.0" encoding="ISO-8859-1"?>^M
<DATA>^M
Tags (2)
0 Karma
1 Solution

jrodman
Splunk Employee
Splunk Employee

The Carriage Return character (^M) does not cause duplicate indexing with Splunk, so you likely have some other problem.

The way splunk handles lines is to split lines on any sequence of carriage returns or linefeeds (also called a newline). These characters are sometimes written ^M and ^L, or CR and LF or NL . Splunk doesn't care which you have in the file, it will just linebreak any any quantity of sequential characters of either type.

Therefore, unless you have a custom LINEBREAKER setting, these characters are gone by the time we get to event merging and so on.

Meanwhile, the strategy used by the tailing processor which reads logfiles doesn't care about the particular bytes you have in your files. It just reads chunks of bytes and hashes the start and end of the file. Most likely, since your file is xml, the end of the file is being rewritten by replacing the close tag with event text.

In other words your file is probably going from:

...
<event10>event text</event10>
</data>

to

...
<event10>event text</event10>
<event11>event text</event11>
</data>

Since the bytes for the data closetag get replaced with the bytes for event eleven, the hash for the file end changes and the file is considered to contain new content.

Workarounds involve monitoring the file after it is complete, or modifying the application to not write out a close tag until the logfile is complete and will no longer be written to.

View solution in original post

jrodman
Splunk Employee
Splunk Employee

The Carriage Return character (^M) does not cause duplicate indexing with Splunk, so you likely have some other problem.

The way splunk handles lines is to split lines on any sequence of carriage returns or linefeeds (also called a newline). These characters are sometimes written ^M and ^L, or CR and LF or NL . Splunk doesn't care which you have in the file, it will just linebreak any any quantity of sequential characters of either type.

Therefore, unless you have a custom LINEBREAKER setting, these characters are gone by the time we get to event merging and so on.

Meanwhile, the strategy used by the tailing processor which reads logfiles doesn't care about the particular bytes you have in your files. It just reads chunks of bytes and hashes the start and end of the file. Most likely, since your file is xml, the end of the file is being rewritten by replacing the close tag with event text.

In other words your file is probably going from:

...
<event10>event text</event10>
</data>

to

...
<event10>event text</event10>
<event11>event text</event11>
</data>

Since the bytes for the data closetag get replaced with the bytes for event eleven, the hash for the file end changes and the file is considered to contain new content.

Workarounds involve monitoring the file after it is complete, or modifying the application to not write out a close tag until the logfile is complete and will no longer be written to.

View solution in original post

shangshin
Contributor

Thank you very much for the detail explanation.

I agree with your point -- carriage Return character (^M) does not cause duplicate indexing

Duplicate event index seems to be related to something else.

I found the error on forwarder's splunkd.log

11-13-2014 10:51:01.063 -0500 INFO  WatchedFile - Checksum for seekptr didn't match, will re-read entire file='/local/0/lns/home/prod/log/jobs.log'.
11-13-2014 10:51:01.063 -0500 INFO  WatchedFile - Will begin reading at offset=0 for file='/local/0/lns/home/prod/log/jobs.log'.

So I tried to modify inputs.conf as below but the issue of duplicate events still persist. Do you have any insight?

[monitor:///local/0/lns/home/prod/log/jobs.log]
host = nylnslxprd01
sourcetype = app_log
index = rsch_app
crcSalt=
followTail = 1
0 Karma

jrodman
Splunk Employee
Splunk Employee

It sure looks like the problem I expected is happening. "seekptr didn't match" means one of two things:

  1. you have multiple files with the same exact initial 256 bytes.
  2. the file bytes are being changed after they are written.

Given that it's an XML file, 2 is by far the most probable as described above.

0 Karma

shangshin
Contributor

Log file is rotated so that's why the checksum is not the same.
Any reason using the attribute crcSalt is not working ???

[monitor:///local/0/lns/home/prod/log/jobs.log]
sourcetype = app_log
index = rsch_app
crcSalt=<SOURCE>
followTail = 1
0 Karma

shangshin
Contributor

Thanks again for the information. I think it's number 2 -- the file bytes are being changed after they are written.

The log file is updated frequently roughly 20 events per second.
Is that the root cause?

If the answer is yes, any remedy?

2014-11-20 11:44:59,029
2014-11-20 11:44:59,065
2014-11-20 11:44:59,070
2014-11-20 11:44:59,071
2014-11-20 11:44:59,377
2014-11-20 11:44:59,396
2014-11-20 11:44:59,396
2014-11-20 11:44:59,543
2014-11-20 11:44:59,573
2014-11-20 11:44:59,578
2014-11-20 11:44:59,578
2014-11-20 11:44:59,886

0 Karma

shangshin
Contributor

Here sis the stanza defined in props.conf on the indexer

[app_log]
TZ = 'America/New_York'
NO_BINARY_CHECK = 1
pulldown_type = 1
BREAK_ONLY_BEFORE = \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d{3}
TIME_FORMAT = %Y-%m-%d %H:%M:%S
0 Karma

jeremiahc4
Builder

As an aside, I usually see the ^M character show up in files written by Windows systems and transferred to Unix. I get rid of those characters using a dos2unix command (i.e. dos2unix ctrlmfile newfile). Is it possible to do that before you index the file?

If not, if you are planning to discard the events with ^M characters totally, then you'll need to employ a props.conf/transforms.conf config change to route these items to nullQueue. See the following answer which may help guide you.

http://answers.splunk.com/answers/108326/regex-and-nullqueue-problem.html

jrodman
Splunk Employee
Splunk Employee

Indeed, tossing the "header" lines is a reasonable thing to do.

0 Karma

jeremiahc4
Builder

Heh, I wasn't even paying attention to what the example data was. Yeah, you probably don't want to remove that line totally. A SEDCMD regex could be used to zap the ^M characters without burning the whole line from the event. But, your other answer below looks like a good route to explore first.

0 Karma

jrodman
Splunk Employee
Splunk Employee

The main thrust is that the ^M characters are probably gone before you can try to zap them.

0 Karma
Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.