Getting Data In

Mulltiline XML extraction...

Path Finder

So I have an xml formatted log added as a source, sourcetype'd as WSE_audit, and I'm trying to get it to basically split on the "event" element and do field extraction on each event. Here's some sample data:

<event rev="1.2">
<date>2010-05-12-15:17:27.279-07:00I-----</date>
<outcome status="0">0</outcome>
<originator blade="webseald" instance="clogin">
<component rev="1.4">authn</component>
<event_id>117</event_id>
<action>0</action>
<location>myhostname</location>
</originator>
<target resource="5">
<object></object>
</target>
<data>
<audit event="Start"/> </data>
</event>

in props.conf:

[WSE_audit]
LINE_BREAKER = .*(\<event\s+[^\>]*\>)
SHOULD_LINEMERGE = false
REPORT-wseaudit = xml-extr

in transforms.conf:

[xml-extr]
REGEX = \<(\w+)\>([^\<]+)\<\1\>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

When I have these in place, these events simply never show up. I take them out, and the events start coming in. I've been staring at it for a while, and am sure I'm missing something stupid... What am I missing here?

Tags (1)
1 Solution

Splunk Employee
Splunk Employee

Your LINE_BREAKER is not a good regex to use. In particular, it's a bad idea to have it start with a .* greedy match, because basically it will go all they way ahead to the end of the entire input stream (up to the limit if there is one), and only then backtrack to the last <event before the end.

So you could just drop the .* from the regex.

BUT the contents of first capture group ($1) in LINE_BREAKER is discarded, which I don't think you want. So

LINE_BREAKER = [\>\s]((?=\<event\s+[^\>]*\>))

is probably more like it. The [\>\s] should actually be unnecessary (and you might replace it with [.\s]) but there was what I would consider a bug (at least in 4.0) that doesn't seem to break if the overall regex match length ($0) is zero length, even if the match succeeds.

View solution in original post

Splunk Employee
Splunk Employee

Your LINE_BREAKER is not a good regex to use. In particular, it's a bad idea to have it start with a .* greedy match, because basically it will go all they way ahead to the end of the entire input stream (up to the limit if there is one), and only then backtrack to the last <event before the end.

So you could just drop the .* from the regex.

BUT the contents of first capture group ($1) in LINE_BREAKER is discarded, which I don't think you want. So

LINE_BREAKER = [\>\s]((?=\<event\s+[^\>]*\>))

is probably more like it. The [\>\s] should actually be unnecessary (and you might replace it with [.\s]) but there was what I would consider a bug (at least in 4.0) that doesn't seem to break if the overall regex match length ($0) is zero length, even if the match succeeds.

View solution in original post

Path Finder

Using BREAK_ONLY_BEFORE and MUST_BREAK_AFTER, it works properly. And there were indeed line breaks in the data, and each event is usually somewhere between 10-20 lines.

Thanks for all your help!
STeve

0 Karma

Splunk Employee
Splunk Employee

Can you also clarify, is your original data actually broken into lines or not? i.e., is the XML all on one line and you're just showing it up top split up for clarity, or are there actual newline characters breaking up the XML? And are there newlines between events. It would certainly help to see more than one event, exactly as it is in your file (use the "code" formatting button in the editor and the preview) since we're trying to split between events.

0 Karma

Splunk Employee
Splunk Employee

How much data are you sending through when you test? Keep in mind with line merging rules, Splunk will hold on to events until it knows it can end one, so if SHOULD_LINEMERGE is on, it will hang on to things until it sees the BREAK_ONLY_BEFORE regex. IF you don't specify a BREAK_ONLY_BEFORE, it splits on BREAK_ONLY_BEFORE_DATE, which isn't what you want here and is more of a crapshoot.

0 Karma

Path Finder

If I do get data through, it's only the line that it split on ()...

0 Karma

Path Finder

I did that, but I'm still not seeing the events trickle through. I've seen some date/time parsing errors, but nothing consistent in splunkd.log. I've tried turning SHOULD_LINEMERGE both false and true, to no avail.

0 Karma

Splunk Employee
Splunk Employee

well, it might be easier (if your XML is line-broken) to just leave LINE_BREAKER default and use BREAK_ONLY_BEFORE = <event instead.

0 Karma

Path Finder

Thanks - I had put that .* in as a desparate test to get it to match anything :). I made this change, but the behavior is still the same - the event never seems to get indexed. There's no way to have any debug information show up for this stuff is there?

0 Karma