Getting Data In

Understanding LINE_BREAKER regexes

stevesq
Explorer

I'm trying to wrap my head around LINE_BREAKER regexes, especially WRT whitespace handling and wildcard matching.

Given a file containing:

y z
xx1
xx2
y z
xx3
xx4
y y
xx5
xx6

And applying either of:

LINE_BREAKER = ([\r\n]+)(?:y\s+z)

LINE_BREAKER = ([\r\n]+)(?:y.*?z)

Splunk will make a new event at "y y", even though I don't want it to. In other words,

I expect:

y z
xx1
xx2

y z
xx3
xx4
y y
xx5
xx6

But splunk actually produces:

y z
xx1
xx2

y z
xx3
xx4

y y
xx5
xx6

Presumably it's matching the "y\s+" / "y.*?" and deciding to break on that line. What am I missing? How can I get it to recognize the "z" in the regex?

woodcock
Esteemed Legend

Did you sent SHOULD_LINEMERGE = false? This should work:

LINE_BREAKER = ([\r\n]+)y\s+z
SHOULD_LINEMERGE = false

hexx
Splunk Employee
Splunk Employee

This appears to be one of those elusive cases where LINE_BREAKER fails where BREAK_ONLY_BEFORE succeeds...

I was able to reproduce the problem you report from your test data with LINE_BREAKER. However. using :

BREAK_ONLY_BEFORE = y\s+z

...I get the expected results. Give BREAK_ONLY_BEFORE a try and let us know if it works. Remember to remove the ([\r\n]+) capture group as BREAK_ONLY_BEFORE doesn't need it.

From props.conf.spec :

BREAK_ONLY_BEFORE = 
* When set, Splunk creates a new event only if it encounters a new line that matches the
regular expression.
* Defaults to empty.

I have opened a bug (SPL-41430) to have our developers take a look at this issue.

UPDATE : As Masa stated, if you are using LINE_BREAKER, you must use SHOULD_LINEMERGE = false. The test file is properly line-broken with the following configuration :


LINE_BREAKER = ([\r\n]+)y\s+z
SHOULD_LINEMERGE = false

Masa
Splunk Employee
Splunk Employee

I would do the same solution as hexx suggested in general.

( I could not add the comment. So, I'm using another answer field.)


Additional Info:

Splunk processes a stream of data as follows;

  1. Break the stream into single line
    LINE_BREAKER will be used here.
    ( At this point, Splunk does not know if event is a single line or not)

  2. Check if need to merge multiple lines as one event
    SHOULD_LINEMERGE, BREAK_ONLY_BEFORE, etc work here
    ( At this point, Splunk recognizes each event as either multi-line or single line)

I think it's possible that the issue was at the line merge time in your case.
Also, the "lookahead (?=)" regex would be more appropriate than "No backreference (?:)" in this case.

So, there is an alternative solution;

LINE_BREAKER = ([\r\n]+)(?=y\s+z)
SHOULD_LINEMERGE = false
LEARN_MODEL = false

I did a quick test with this, and it worked for me.

If this does not work, possibly there is props.conf in learned app generated configuration for this event.
In that case, delete the part in $SPLUNK_HOME/etc/apps/learned/local/props.conf.

Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.