Splunk Search

Why does BREAK_ONLY_BEFORE work while LINE_BREAKER doesn't?

jeffland
SplunkTrust
SplunkTrust

I'm trying to work out some sourcetype settings. The events look like this:

2015.07.13 08:38:47: system,DEBUG: <<SomeListener>> Thread-foo/host.something.Listener$Method$1@somewhere setting set to false
2015.07.13 08:38:47: system,DEBUG: <<SomeRule>> [ID(ID_aswell)] .method() config=[. fooDO: OID=digits
2015.07.13 08:38:47: system,DEBUG+ . GATEWAYID=. . GatewayDO: OID=digits
2015.07.13 08:38:47: system,DEBUG+ . . SHORTNAME=foo
2015.07.13 08:38:47: system,DEBUG+ . . LONGNAME=FOObar

Those are multiline events, so I only want to break when there's a : after the DEBUG, which led me to the following regex:

(\r\n)*\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]

When I use this with BREAK_ONLY_BEFORE, it works like a charm. However, I don't see the need to first break lines and then merge them if the same can also be accomplished by a linebreak alone. So I also tried the above regex with LINE_BREAKER, but then the data in the preview and after indexing becomes this:

015.07.13 08:38:47: system,DEBUG: <<SomeListener>> Thread-foo/host.something.Listener$Method$1@somewhere setting set to false
015.07.13 08:38:47: system,DEBUG: <<SomeRule>> [ID(ID_aswell)] .method() config=[. fooDO: OID=digits
015.07.13 08:38:47: system,DEBUG+ . GATEWAYID=. . GatewayDO: OID=digits
015.07.13 08:38:47: system,DEBUG+ . . SHORTNAME=foo
015.07.13 08:38:47: system,DEBUG+ . . LONGNAME=FOObar

with obviously wrong timestamps and some weird markup in the preview, but otherwise ok data (especially the multiline events are still preserved).

Whats happening here?
On a side note, is my understanding correct that LINE_BREAKER is generally preferable to BREAK_ONLY_BEFORE from a processing load perspective? If yes, why does the "Add Data" wizard in Splunk always use the latter and doesn't allow the user to set LINE_BREAKER under "Advanced" explicitly, so you really have to set this thing via the props.conf file directly if you want to use it?

0 Karma
1 Solution

woodcock
Esteemed Legend

I cannot speak to why the wizard does what it does but I can explain what confuses most people about LINE_BREAKER. Once you redefine LINE_BREAKER from the default, it now has nothing to do with newlines, which means that "line" doesn't mean what you think it means, and so SHOULD_LINEMERGE doesn't, either. Generally, use LINE_BREAKER= and SHOULD_LINEMERGE = false together.
Splunk processes every stream of input data as follows:

•Break the stream into a single "line" using LINE_BREAKER. The default LINE_BREAKER ([\r\n]+) prevents newlines but yours probably allows them.
•Check if we are done (SHOULD_LINEMERGE=false) or if we are merging multiple "lines" into one event using, BREAK_ONLY_BEFORE, etc.  At this point, Splunk recognizes each event as either multi-"line" or single-"line", as defined by "LINE_BREAKER" not as defined by a newline character boundary (as you are used to thinking).

So the problem you are specifically having is probably because you were using BOTH LINE_BREAKER= AND SHOULD_LINEMERGE=true (which is the default), which is why you needed to add in the BREAK_ONLY_BEFORE. If you use ONLY LINE_BREAKER= and SHOULD_LINEMERGE, then you should not need BREAK_ONLY_BEFORE. You should always put in a SHOULD_LINEMERGE so that you are not mis-remembering the default and to "comment" your explicit desire, so that people don't try to "help" you by "fixing" it later and adding it, which will break everything.

View solution in original post

0 Karma

woodcock
Esteemed Legend

I cannot speak to why the wizard does what it does but I can explain what confuses most people about LINE_BREAKER. Once you redefine LINE_BREAKER from the default, it now has nothing to do with newlines, which means that "line" doesn't mean what you think it means, and so SHOULD_LINEMERGE doesn't, either. Generally, use LINE_BREAKER= and SHOULD_LINEMERGE = false together.
Splunk processes every stream of input data as follows:

•Break the stream into a single "line" using LINE_BREAKER. The default LINE_BREAKER ([\r\n]+) prevents newlines but yours probably allows them.
•Check if we are done (SHOULD_LINEMERGE=false) or if we are merging multiple "lines" into one event using, BREAK_ONLY_BEFORE, etc.  At this point, Splunk recognizes each event as either multi-"line" or single-"line", as defined by "LINE_BREAKER" not as defined by a newline character boundary (as you are used to thinking).

So the problem you are specifically having is probably because you were using BOTH LINE_BREAKER= AND SHOULD_LINEMERGE=true (which is the default), which is why you needed to add in the BREAK_ONLY_BEFORE. If you use ONLY LINE_BREAKER= and SHOULD_LINEMERGE, then you should not need BREAK_ONLY_BEFORE. You should always put in a SHOULD_LINEMERGE so that you are not mis-remembering the default and to "comment" your explicit desire, so that people don't try to "help" you by "fixing" it later and adding it, which will break everything.

0 Karma

jeffland
SplunkTrust
SplunkTrust

Perhaps I didn't make it clear enough, but I used BREAK_ONLY_BEFORE and LINE_BREAKER exclusively - and I also added SHOULD_LINEMERGE = false to the LINE_BREAKER version, because that defaults to true if I'm not mistaken. So I had these two configurations in my props.conf:

# A
[sourcetype]
NO_BINARY_CHECK = true
BREAK_ONLY_BEFORE = (\r\n)?\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]

and

# B    
[sourcetype]
NO_BINARY_CHECK = true
LINE_BREAKER = (\r\n)?\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]
SHOULD_LINEMERGE = false

with the former working like it should and the second somehow removing the initial "2" of the year in the timestamp, thus messing everything up.

0 Karma

woodcock
Esteemed Legend

I see what could be the problem, you have a string instead of a character class for your linebreaks; why are you using (\r\n) instead of ([\r\n]+)? And why did you make them optional with the question mark? I think probably you need this:

# A
[sourcetype]
NO_BINARY_CHECK = true
BREAK_ONLY_BEFORE = ([\r\n]+)\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]

and

# B
[sourcetype]
NO_BINARY_CHECK = true
LINE_BREAKER = ([\r\n]+)\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]
SHOULD_LINEMERGE = false

jeffland
SplunkTrust
SplunkTrust

Ah! Yes, the issue was caused by not using a character class... well, I'd say, more precisely because in my initial settings, I required exactly a return and a newline. I feel pretty dumb for not noticing that myself. I don't exactly understand why this led to the described behavior though.

I made the capturing group optional because I've had it happen to me that two events weren't separated by a return/newline, and then making the capturing group optional still made splunk break them into two events. It kinda caught on; is there any downside to it?

By the way, I've further "simplified" the regex by using a group for the date and time pattern instead of the explicit two-digit-one-separator expression, so for anyone following this it now looks like this:

([\r\n]+)\d{2}(?:\d{2}.){6}\s[^\,]+\,[^+:]+[\:]
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...