Getting Data In

Mulltiline XML extraction...

Steve_Litras
Path Finder

So I have an xml formatted log added as a source, sourcetype'd as WSE_audit, and I'm trying to get it to basically split on the "event" element and do field extraction on each event. Here's some sample data:

<event rev="1.2">
<date>2010-05-12-15:17:27.279-07:00I-----</date>
<outcome status="0">0</outcome>
<originator blade="webseald" instance="clogin">
<component rev="1.4">authn</component>
<event_id>117</event_id>
<action>0</action>
<location>myhostname</location>
</originator>
<target resource="5">
<object></object>
</target>
<data>
<audit event="Start"/> </data>
</event>

in props.conf:

[WSE_audit]
LINE_BREAKER = .*(\<event\s+[^\>]*\>)
SHOULD_LINEMERGE = false
REPORT-wseaudit = xml-extr

in transforms.conf:

[xml-extr]
REGEX = \<(\w+)\>([^\<]+)\<\1\>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

When I have these in place, these events simply never show up. I take them out, and the events start coming in. I've been staring at it for a while, and am sure I'm missing something stupid... What am I missing here?

Tags (1)
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

Your LINE_BREAKER is not a good regex to use. In particular, it's a bad idea to have it start with a .* greedy match, because basically it will go all they way ahead to the end of the entire input stream (up to the limit if there is one), and only then backtrack to the last <event before the end.

So you could just drop the .* from the regex.

BUT the contents of first capture group ($1) in LINE_BREAKER is discarded, which I don't think you want. So

LINE_BREAKER = [\>\s]((?=\<event\s+[^\>]*\>))

is probably more like it. The [\>\s] should actually be unnecessary (and you might replace it with [.\s]) but there was what I would consider a bug (at least in 4.0) that doesn't seem to break if the overall regex match length ($0) is zero length, even if the match succeeds.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

Your LINE_BREAKER is not a good regex to use. In particular, it's a bad idea to have it start with a .* greedy match, because basically it will go all they way ahead to the end of the entire input stream (up to the limit if there is one), and only then backtrack to the last <event before the end.

So you could just drop the .* from the regex.

BUT the contents of first capture group ($1) in LINE_BREAKER is discarded, which I don't think you want. So

LINE_BREAKER = [\>\s]((?=\<event\s+[^\>]*\>))

is probably more like it. The [\>\s] should actually be unnecessary (and you might replace it with [.\s]) but there was what I would consider a bug (at least in 4.0) that doesn't seem to break if the overall regex match length ($0) is zero length, even if the match succeeds.

Steve_Litras
Path Finder

Using BREAK_ONLY_BEFORE and MUST_BREAK_AFTER, it works properly. And there were indeed line breaks in the data, and each event is usually somewhere between 10-20 lines.

Thanks for all your help!
STeve

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Can you also clarify, is your original data actually broken into lines or not? i.e., is the XML all on one line and you're just showing it up top split up for clarity, or are there actual newline characters breaking up the XML? And are there newlines between events. It would certainly help to see more than one event, exactly as it is in your file (use the "code" formatting button in the editor and the preview) since we're trying to split between events.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

How much data are you sending through when you test? Keep in mind with line merging rules, Splunk will hold on to events until it knows it can end one, so if SHOULD_LINEMERGE is on, it will hang on to things until it sees the BREAK_ONLY_BEFORE regex. IF you don't specify a BREAK_ONLY_BEFORE, it splits on BREAK_ONLY_BEFORE_DATE, which isn't what you want here and is more of a crapshoot.

0 Karma

Steve_Litras
Path Finder

If I do get data through, it's only the line that it split on ()...

0 Karma

Steve_Litras
Path Finder

I did that, but I'm still not seeing the events trickle through. I've seen some date/time parsing errors, but nothing consistent in splunkd.log. I've tried turning SHOULD_LINEMERGE both false and true, to no avail.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

well, it might be easier (if your XML is line-broken) to just leave LINE_BREAKER default and use BREAK_ONLY_BEFORE = <event instead.

0 Karma

Steve_Litras
Path Finder

Thanks - I had put that .* in as a desparate test to get it to match anything :). I made this change, but the behavior is still the same - the event never seems to get indexed. There's no way to have any debug information show up for this stuff is there?

0 Karma
Get Updates on the Splunk Community!

Splunk is Nurturing Tomorrow’s Cybersecurity Leaders Today

Meet Carol Wright. She leads the Splunk Academic Alliance program at Splunk. The Splunk Academic Alliance ...

Part 2: A Guide to Maximizing Splunk IT Service Intelligence

Welcome to the second segment of our guide. In Part 1, we covered the essentials of getting started with ITSI ...

Part 1: A Guide to Maximizing Splunk IT Service Intelligence

As modern IT environments continue to grow in complexity and speed, the ability to efficiently manage and ...