Splunk Search

How can I set a custom linemerge sourcetype event break regex to ensure that an entire event is ready to be captured

bestpa
Explorer

Hi everyone,

I have a monitored file that is appended to by a cron job. Sometimes splunk checks the file in the middle of >> appended output , and cuts an event before it's ready. This results in a split event.

In order to get around this I'm trying to get a regex expression that will only "fire" an event break when i know the last line of the multi-line event has been written to the file.

Here's an example with 2 raw entries:

Start: Mon, 11 Apr 2016 13:26:01 -0400
Starting up poller script.  
Database connection established.
Garbage line found, likely the header of the csv.  I'm not going to insert.
Affected Rows: 5 
Parsed total lines: 1009
End: Mon, 11 Apr 2016 13:26:02 -0400
Start: Mon, 11 Apr 2016 13:31:01 -0400
Starting up poller script.  
Database connection established.
Garbage line found, likely the header of the csv.  I'm not going to insert.
Affected Rows: 5 
Parsed total lines: 1009
End: Mon, 11 Apr 2016 13:31:01 -0400

If i take that data and use the regex of /(?s)Start:.*?End:/gm at regex101 , i get what I want. I am unable to effect an ungreedy multiline search in the splunk event-break regex pattern.

No, i won't take answers like " output your cronjob to a temp location, then append it all at once to the file you are monitoring".

0 Karma
1 Solution

bestpa
Explorer

I resolved this and have tested for over 75 polls now.

The trick is to use this mysterious on/off inputs.conf line called:

multiline_event_extra_waittime = true

There's really no documentation on it other than the comments in the inputs.conf factory file:

multiline_event_extra_waittime = [true|false]
*By default, Splunk Enterprise sends an event delimiter when (1) it reaches EOF of a file it monitors and (2) the last char it reads is a newline.
*In some cases, it takes time for all lines of a multiple-line event to arrive.
*Set to true to delay sending an event delimiter until the time that Splunk Enterprise closes the file, as defined by the time_before_close attribute, to allow all event lines to arrive.
*Default to false.

and so, my final solution for my data stream was:

props.conf

 root@server1:/opt/splunk/etc/apps/search/local# more props.conf 
 [custom_source_type_poller_v1]
 SHOULD_LINEMERGE = true
 LINE_BREAKER = ([\r\n]+)(?=Start: \w+, \d+)
 TIME_PREFIX = ^Start:\s
 TIME_FORMAT = %a, %d %b %Y %H:%M:%S %z
 MAX_TIMESTAMP_LOOKAHEAD = 31
 NO_BINARY_CHECK = true
 category = Custom
 pulldown_type = 1
 disabled = false

inputs.conf

 root@server1:/opt/splunk/etc/apps/search/local# more inputs.conf 
 [monitor:///data/poller/poller_v1.log]
disabled = false
index = poller_index
sourcetype = custom_source_type_poller_v1
time_before_close = 15
multiline_event_extra_waittime = true

View solution in original post

bestpa
Explorer

I resolved this and have tested for over 75 polls now.

The trick is to use this mysterious on/off inputs.conf line called:

multiline_event_extra_waittime = true

There's really no documentation on it other than the comments in the inputs.conf factory file:

multiline_event_extra_waittime = [true|false]
*By default, Splunk Enterprise sends an event delimiter when (1) it reaches EOF of a file it monitors and (2) the last char it reads is a newline.
*In some cases, it takes time for all lines of a multiple-line event to arrive.
*Set to true to delay sending an event delimiter until the time that Splunk Enterprise closes the file, as defined by the time_before_close attribute, to allow all event lines to arrive.
*Default to false.

and so, my final solution for my data stream was:

props.conf

 root@server1:/opt/splunk/etc/apps/search/local# more props.conf 
 [custom_source_type_poller_v1]
 SHOULD_LINEMERGE = true
 LINE_BREAKER = ([\r\n]+)(?=Start: \w+, \d+)
 TIME_PREFIX = ^Start:\s
 TIME_FORMAT = %a, %d %b %Y %H:%M:%S %z
 MAX_TIMESTAMP_LOOKAHEAD = 31
 NO_BINARY_CHECK = true
 category = Custom
 pulldown_type = 1
 disabled = false

inputs.conf

 root@server1:/opt/splunk/etc/apps/search/local# more inputs.conf 
 [monitor:///data/poller/poller_v1.log]
disabled = false
index = poller_index
sourcetype = custom_source_type_poller_v1
time_before_close = 15
multiline_event_extra_waittime = true

somesoni2
SplunkTrust
SplunkTrust

Try something like this

inputs.conf

[monitor://.......]
...other settings...
sourcetype=yoursourcetype
time_before_close= 15                  ***This is time in seconds

props.conf

[yoursourcetype]
SHOULD_LINEMERGE=false
LINE_BREAKER = ([\r\n]+)(?=Start: \w+, \d+)
TIME_PREFIX = ^Start:\s
TIME_FORMAT=%a, %d %b %Y %H:%M:%S %z
MAX_TIMESTAMP_LOOKAHEAD  = 31
0 Karma

bestpa
Explorer

No dice. My events are still getting split.

props.conf

root@server1:/opt/splunk/etc/apps/search/local# more props.conf 
[custom_source_type_poller_v1]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)(?=Start: \w+, \d+)
TIME_PREFIX = ^Start:\s
TIME_FORMAT = %a, %d %b %Y %H:%M:%S %z
MAX_TIMESTAMP_LOOKAHEAD = 31
NO_BINARY_CHECK = true
category = Custom
pulldown_type = 1
disabled = false

input.conf

root@server1:/opt/splunk/etc/apps/search/local# more inputs.conf 

[monitor:///data/poller/poller_v1.log]
disabled = false
index = poller_index
sourcetype = custom_source_type_poller_v1
time_before_close = 15
0 Karma

yoshispendiff
Explorer

Is it every event that isn't getting split, or just the first one? I've ran into a problem using time_before_close before where the first event after an arbitrary period of not sending any messages would always split no matter what I did. Any messages shortly after that would work fine though, even with exactly the same message or rule (i.e if I triggered the cron job a second time and it would have the same output it would work fine).

There's definitely some strangeness going on in the way long running events get send/indexed in Splunk. I ended up just fixing it at search time instead as I couldn't find a fix for it.

0 Karma

javiergn
SplunkTrust
SplunkTrust

Hi, not sure if this is what you are trying to achieve but here we go.

| stats count
| eval _raw = "
Start: Mon, 11 Apr 2016 13:26:01 -0400
Starting up poller script.  
Database connection established.
Garbage line found, likely the header of the csv.  I'm not going to insert.
Affected Rows: 5 
Parsed total lines: 1009
End: Mon, 11 Apr 2016 13:26:02 -0400
Start: Mon, 11 Apr 2016 13:31:01 -0400
Starting up poller script.  
Database connection established.
Garbage line found, likely the header of the csv.  I'm not going to insert.
Affected Rows: 5 
Parsed total lines: 1009
End: Mon, 11 Apr 2016 13:31:01 -0400
"
| rex max_match=0 "(?msi)Start:(?<my_event>.*?)End:"
| table my_event

Output:

my_event
--------------------------------------------------------------------------------------------------------
Mon, 11 Apr 2016 13:26:01 -0400 Starting up poller script. Database connection established. Garbage line found, likely the header of the csv. I'm not going to insert. Affected Rows: 5 Parsed total lines: 1009

Mon, 11 Apr 2016 13:31:01 -0400 Starting up poller script. Database connection established. Garbage line found, likely the header of the csv. I'm not going to insert. Affected Rows: 5 Parsed total lines: 1009 

Let me know if that helps.

Thanks,
J

0 Karma

javiergn
SplunkTrust
SplunkTrust

Ah see.
In that case, go to the Timestamp submenu instead and do the following:

Extraction: Advanced
Timestamp prefix (without the quotes but including the space after the colon): "Start: "

Leave everything else as default.
See picture below.

Hope that helps

alt text

0 Karma

bestpa
Explorer

I understand that this is the best way to do it on a static file, but the problem is in the context of a monitored file.

When cron-jobbed output is being appended to the monitored file (the cron can take 1, 2 or up to 10 seconds to finish dumping those 7 lines of output), splunk can "look" at the file mid-way through the event output.

If it determines a match at that perfect time, then it will cut my event short.

Doing a regex match for the total output of the event (to ensure the beginning and end has been finalized and written in total), i wanted to use the event break regex to prevent splunk from detrmining an event has occured by determining that no event has occured until the regex comes back "true".

That is why i am trying to tell it to only consider an event break occured when it sees "Start:" at the top and "End:" at the bottom.

Here's an example of how splunk chops up my event using the normal timestamp method:

I can't seem to post pics yet, here's the listed output of indexed events: See how event was broken up for 2) and 3) ? My script had not yet fully output all there was to output before splunk picked up the "new event". When it next found a match for an event break, it did so (nicely), but it lumped the chunk of ascii between the last match and this match into an event between 2) and 4). Ergo event 2) and event 3) were split. Event 5 and 6 are perfect because I lucked out as to when splunk monitored the file and discovered the new (full) events.

2   4/13/16
12:51:01.000 PM 
Start: Wed, 13 Apr 2016 12:51:01 -0400
Starting up poller script.  
Database connection established.
Garbage line found, likely the header of the csv.  I'm not going to insert.
host = server1 source = /data/poller/poller_v1.log sourcetype = custom_source_type_poller
3   4/13/16
12:46:01.000 PM 
Affected Rows: 5 
Parsed total lines: 1009
End: Wed, 13 Apr 2016 12:46:01 -0400
host = server1 source = /data/poller/poller_v1.log sourcetype = custom_source_type_poller
4   4/13/16
12:46:01.000 PM 
Start: Wed, 13 Apr 2016 12:46:01 -0400
Starting up poller script.  
Database connection established.
Garbage line found, likely the header of the csv.  I'm not going to insert.
host = server1 source = /data/poller/poller_v1.log sourcetype = custom_source_type_poller
5   4/13/16
12:41:01.000 PM 
Start: Wed, 13 Apr 2016 12:41:01 -0400
Starting up poller script.  
Database connection established.
Garbage line found, likely the header of the csv.  I'm not going to insert.
Affected Rows: 5 
Parsed total lines: 1009
End: Wed, 13 Apr 2016 12:41:02 -0400
host = server1 source = /data/poller/poller_v1.log sourcetype = custom_source_type_poller
6   4/13/16
12:36:01.000 PM 
Start: Wed, 13 Apr 2016 12:36:01 -0400
Starting up poller script.  
Database connection established.
Garbage line found, likely the header of the csv.  I'm not going to insert.
Affected Rows: 5 
Parsed total lines: 1009
End: Wed, 13 Apr 2016 12:36:02 -0400
host = server1 source = /data/poller/poller_v1.log sourcetype = custom_source_type_poller
0 Karma

bestpa
Explorer

Found some other attributes I can play with to hopefully solve the mystery...

http://docs.splunk.com/Documentation/Splunk/6.2.2/Data/Indexmulti-lineevents

MUST_BREAK_AFTER = * When set and the regular expression matches the current line, Splunk Enterprise always creates a new event for the next input line.
Splunk Enterprise might still break before the current line if another rule matches.
empty string
MUST_NOT_BREAK_AFTER = * When set and the current line matches the regular expression, Splunk Enterprise does not break on any subsequent lines until the MUST_BREAK_AFTER expression matches. empty string
MUST_NOT_BREAK_BEFORE = * When set and the current line matches the regular expression, Splunk Enterprise does not break the last event before the current line. empty string

0 Karma

javiergn
SplunkTrust
SplunkTrust

That's good.
Please post your answer here if you find it in order to help other users.

If you are still having problems let us know.

Thanks,
J

0 Karma

bestpa
Explorer

Yup, i've been trying various combinations of data source settings, but with no success. The difficulty in playing with the settings is that what may appear to work when getting a preview with the Add Data wizard, won't work in the wild when splunk intrudes midway through a cron jobby's output.

There has to be a zillion people with my problem. I will continue to try new things and look for other articles related.

I stumbled upon another article that died without resolution as well. I tried his method of using "time_before_close" attribute of 5, but still no luck.

https://answers.splunk.com/answers/188776/events-are-not-properly-split.html

0 Karma

bestpa
Explorer

Thanks for getting into this with me J

Your answer was in the context of a search-app search, when providing a raw string. I tried

(?msi)Start:(?.*?)End:

however,

I'm stuck at the data-input section of set source type where you can specify a line of regex in order to direct splunk on where the event-break resides.

I tried your variation of the regular expression there, however, the result is the same. The preview shows a hit for the line that begins "Match", but doesn't stop when it hits the first "End:". I get about 257 lines in that event (each should contain only 7 lines. Presumably this is because it wants to continue searching to the bottom of my growing file , but gets capped by MAX_EVENTS.

I played with setting the greediness mode, thus switching the question mark's function,
(?ms)Start:.*?End:
(?Ums)Start:.*End:
but it seems to be greedy either way.

I can visually see what i'm after in regex101 website with my originally posted regex, but it doesn't seem to be processing the same while in splunk's data-source regex field.

For the record of where i am entering the regex...

add data > monitor > source Files and Directory > index once (file.log) > set source type, submenu event-breaks > regex... > field to fill in the regex here and preview on the right.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...