Solved: Why indexing removes carriage return characters (0...

hannus · ‎09-13-2016

Example data in a file which should become a multi line event:
111111
222222

Both lines end with CR+LF (0x0d+0x0a), this is on Windows 7.

I create a new index for this. I import this data file using the Add Data wizard in Splunk Enterprise. I let it even use defaults by not specifying any source type and letting it create it. Then I open the file "0" in program folder "rawdata" with a hex editor. There I can see that 0x0d has been removed. However 0x0a is in the raw data file. Carriage return is removed, newline is not.

Is this normal Splunk Enterprise functionality ? Or do I have some setting that causes it ? I can't figure this out...

Thanks in advance!

jkat54 · ‎09-13-2016

You need to study what is called event breaking.

By default Splunk is breaking your data I to individual events. The default line breaker is ([\r\n]+) and everything in the capture group is discarded.

If you want to preserve the format you need to customize your props.conf. Something like this might work:

[sourcetypeName]
SHOULD_LINEMERGE=true
MUST_BREAK_AFTER= Randomstring
TRUNCATE=9999999

I think there is a specific props.conf setting for end of file as the line breaker/must break before setting.

View solution in original post

jkat54 · ‎09-13-2016

You need to study what is called event breaking.

By default Splunk is breaking your data I to individual events. The default line breaker is ([\r\n]+) and everything in the capture group is discarded.

If you want to preserve the format you need to customize your props.conf. Something like this might work:

[sourcetypeName]
SHOULD_LINEMERGE=true
MUST_BREAK_AFTER= Randomstring
TRUNCATE=9999999

I think there is a specific props.conf setting for end of file as the line breaker/must break before setting.

hannus · ‎09-15-2016

I just can't make this work.

I get the meaning of LINE_BREAKER and the default value. It is clearly meant for log files where every line ends with some combination of CR and LF. And it removes them so that only the real data is kept, not the end-of-line characters. Well I thought that if I add some random string to LINE_BREAKER it would not find it and it keeps the CR and LF and not try to replace them. But no. It decides to change them anyway.

So I guess there is something (actually alot since I'm a newbie) that I'm not understanding. I wonder if anyone could solve this.

jkat54 · ‎09-15-2016

Where is the props.conf and how are you ingesting the data?

hannus · ‎09-15-2016

While working on this problem I'm using "Add Data" wizard from the main UI. Using SE version 6.4.1.
Props.conf:
[A_MyTestSourcetype]
NO_BINARY_CHECK = false
category = Custom
description = Testing CR
pulldown_type = 1
disabled = false
SHOULD_LINEMERGE = false
LINE_BREAKER=(ABabABab)
TRUNCATE = 9999999
File (for example):
111 cr 11111 crlf
22 cr 222222 crlf
33333333 crlf
444 lf 44444
In this best case CR inside the line are kept but at the end they are removed.
Thanks for taking time to help!

jkat54 · ‎09-15-2016

How about this:

[mysinglefilesourcetype]
SHOULD_LINEMERGE = false
LINE_BREAKER = ((*FAIL))
TRUNCATE = 99999999

https://answers.splunk.com/answers/106075/each-file-as-one-single-splunk-event.html

hannus · ‎09-16-2016

That LINE_BREAKER = ((*FAIL)) seems to do the trick. Splunk now indexes the imported file correctly (as seen from the "0" file in "rawdata" folder).
I suppose you don't know how to export data exactly how it is in the index file...? My export tests (from GUI) show:
CR -> CR (ok)
LF -> LF (ok)
LF+CR -> LF+CR (ok) BUT
CR+LF -> LF (fail)
I need to do some more digging on exporting data... I suppose I will create whole another question for this. Thank you very much for your help on this!

jkat54 · ‎09-16-2016

Yeah unfortunately I claim no expertise for the file export issue. Can you open another question with just that there and I'll upvote / me-too it? I think you'll want to submit a ticket for that. You might also consider exporting via the API to see if the behavior is different.

Do you mind marking my answer as the answer to your main question?

Thanks,
Michael

hannus · ‎09-16-2016

I currently evaluating the product with no paid license so I guess I'm not in a position where I could submit a ticket...

jkat54 · ‎09-15-2016

Also, LINE_BREAKER must have a capture group that will be discarded. For example:

(RandomStringThatDoesntOccurInYourData)

hannus · ‎09-13-2016

And while this is out in the open, why Splunk adds newline (0x0a) character in the end of the export (at least from GUI). I'd need to get data in and out unchanged!

Why indexing removes carriage return characters (0x0d)?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!