<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Mulltiline XML extraction... in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14542#M1506</link>
    <description>&lt;P&gt;So I have an xml formatted log added as a source, sourcetype'd as WSE_audit, and I'm trying to get it to basically split on the "event" element and do field extraction on each event. Here's some sample data:&lt;/P&gt;

&lt;P&gt;&amp;lt;event rev="1.2"&amp;gt;&lt;BR /&gt;
&amp;lt;date&amp;gt;2010-05-12-15:17:27.279-07:00I-----&amp;lt;/date&amp;gt;&lt;BR /&gt;
&amp;lt;outcome status="0"&amp;gt;0&amp;lt;/outcome&amp;gt;&lt;BR /&gt;
&amp;lt;originator blade="webseald" instance="clogin"&amp;gt;&lt;BR /&gt;&amp;lt;component rev="1.4"&amp;gt;authn&amp;lt;/component&amp;gt;&lt;BR /&gt;&amp;lt;event_id&amp;gt;117&amp;lt;/event_id&amp;gt;&lt;BR /&gt;
&amp;lt;action&amp;gt;0&amp;lt;/action&amp;gt;&lt;BR /&gt;
&amp;lt;location&amp;gt;myhostname&amp;lt;/location&amp;gt;&lt;BR /&gt;
&amp;lt;/originator&amp;gt;&lt;BR /&gt;
&amp;lt;target resource="5"&amp;gt;&lt;BR /&gt;&amp;lt;object&amp;gt;&amp;lt;/object&amp;gt;&lt;BR /&gt;&amp;lt;/target&amp;gt;&lt;BR /&gt;
&amp;lt;data&amp;gt;&lt;BR /&gt;
&amp;lt;audit event="Start"/&amp;gt;
&amp;lt;/data&amp;gt;&lt;BR /&gt;
&amp;lt;/event&amp;gt;&lt;/P&gt;

&lt;P&gt;in props.conf:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[WSE_audit]
LINE_BREAKER = .*(\&amp;lt;event\s+[^\&amp;gt;]*\&amp;gt;)
SHOULD_LINEMERGE = false
REPORT-wseaudit = xml-extr
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;in transforms.conf:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[xml-extr]
REGEX = \&amp;lt;(\w+)\&amp;gt;([^\&amp;lt;]+)\&amp;lt;\1\&amp;gt;
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;When I have these in place, these events simply never show up. I take them out, and the events start coming in. I've been staring at it for a while, and am sure I'm missing something stupid... What am I missing here?&lt;/P&gt;</description>
    <pubDate>Sat, 29 May 2010 05:49:36 GMT</pubDate>
    <dc:creator>Steve_Litras</dc:creator>
    <dc:date>2010-05-29T05:49:36Z</dc:date>
    <item>
      <title>Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14542#M1506</link>
      <description>&lt;P&gt;So I have an xml formatted log added as a source, sourcetype'd as WSE_audit, and I'm trying to get it to basically split on the "event" element and do field extraction on each event. Here's some sample data:&lt;/P&gt;

&lt;P&gt;&amp;lt;event rev="1.2"&amp;gt;&lt;BR /&gt;
&amp;lt;date&amp;gt;2010-05-12-15:17:27.279-07:00I-----&amp;lt;/date&amp;gt;&lt;BR /&gt;
&amp;lt;outcome status="0"&amp;gt;0&amp;lt;/outcome&amp;gt;&lt;BR /&gt;
&amp;lt;originator blade="webseald" instance="clogin"&amp;gt;&lt;BR /&gt;&amp;lt;component rev="1.4"&amp;gt;authn&amp;lt;/component&amp;gt;&lt;BR /&gt;&amp;lt;event_id&amp;gt;117&amp;lt;/event_id&amp;gt;&lt;BR /&gt;
&amp;lt;action&amp;gt;0&amp;lt;/action&amp;gt;&lt;BR /&gt;
&amp;lt;location&amp;gt;myhostname&amp;lt;/location&amp;gt;&lt;BR /&gt;
&amp;lt;/originator&amp;gt;&lt;BR /&gt;
&amp;lt;target resource="5"&amp;gt;&lt;BR /&gt;&amp;lt;object&amp;gt;&amp;lt;/object&amp;gt;&lt;BR /&gt;&amp;lt;/target&amp;gt;&lt;BR /&gt;
&amp;lt;data&amp;gt;&lt;BR /&gt;
&amp;lt;audit event="Start"/&amp;gt;
&amp;lt;/data&amp;gt;&lt;BR /&gt;
&amp;lt;/event&amp;gt;&lt;/P&gt;

&lt;P&gt;in props.conf:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[WSE_audit]
LINE_BREAKER = .*(\&amp;lt;event\s+[^\&amp;gt;]*\&amp;gt;)
SHOULD_LINEMERGE = false
REPORT-wseaudit = xml-extr
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;in transforms.conf:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;[xml-extr]
REGEX = \&amp;lt;(\w+)\&amp;gt;([^\&amp;lt;]+)\&amp;lt;\1\&amp;gt;
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;When I have these in place, these events simply never show up. I take them out, and the events start coming in. I've been staring at it for a while, and am sure I'm missing something stupid... What am I missing here?&lt;/P&gt;</description>
      <pubDate>Sat, 29 May 2010 05:49:36 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14542#M1506</guid>
      <dc:creator>Steve_Litras</dc:creator>
      <dc:date>2010-05-29T05:49:36Z</dc:date>
    </item>
    <item>
      <title>Re: Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14543#M1507</link>
      <description>&lt;P&gt;Your &lt;CODE&gt;LINE_BREAKER&lt;/CODE&gt; is not a good regex to use. In particular, it's a bad idea to have it start with a &lt;CODE&gt;.*&lt;/CODE&gt; greedy match, because basically it will go all they way ahead to the end of the entire input stream (up to the limit if there is one), and only then backtrack to the last &lt;CODE&gt;&amp;lt;event&lt;/CODE&gt; before the end. &lt;/P&gt;

&lt;P&gt;So you could just drop the &lt;CODE&gt;.*&lt;/CODE&gt; from the regex.&lt;/P&gt;

&lt;P&gt;BUT the contents of first capture group ($1) in LINE_BREAKER is discarded, which I don't think you want. So&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;LINE_BREAKER = [\&amp;gt;\s]((?=\&amp;lt;event\s+[^\&amp;gt;]*\&amp;gt;))
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;is probably more like it. The &lt;CODE&gt;[\&amp;gt;\s]&lt;/CODE&gt; should actually be unnecessary (and you might replace it with &lt;CODE&gt;[.\s]&lt;/CODE&gt;) but there was what I would consider a bug (at least in 4.0) that doesn't seem to break if the overall regex match length ($0) is zero length, even if the match succeeds.&lt;/P&gt;</description>
      <pubDate>Sat, 29 May 2010 07:30:13 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14543#M1507</guid>
      <dc:creator>gkanapathy</dc:creator>
      <dc:date>2010-05-29T07:30:13Z</dc:date>
    </item>
    <item>
      <title>Re: Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14544#M1508</link>
      <description>&lt;P&gt;Thanks - I had put that .* in as a desparate test to get it to match anything :). I made this change, but the behavior is still the same - the event never seems to get indexed. There's no way to have any debug information show up for this stuff is there?&lt;/P&gt;</description>
      <pubDate>Sat, 29 May 2010 10:37:57 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14544#M1508</guid>
      <dc:creator>Steve_Litras</dc:creator>
      <dc:date>2010-05-29T10:37:57Z</dc:date>
    </item>
    <item>
      <title>Re: Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14545#M1509</link>
      <description>&lt;P&gt;well, it might be easier (if your XML is line-broken) to just leave LINE_BREAKER default and use BREAK_ONLY_BEFORE = &amp;lt;event instead.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 09:13:06 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14545#M1509</guid>
      <dc:creator>gkanapathy</dc:creator>
      <dc:date>2020-09-28T09:13:06Z</dc:date>
    </item>
    <item>
      <title>Re: Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14546#M1510</link>
      <description>&lt;P&gt;I did that, but I'm still not seeing the events trickle through. I've seen some date/time parsing errors, but nothing consistent in splunkd.log. I've tried turning SHOULD_LINEMERGE both false and true, to no avail.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jun 2010 05:19:01 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14546#M1510</guid>
      <dc:creator>Steve_Litras</dc:creator>
      <dc:date>2010-06-02T05:19:01Z</dc:date>
    </item>
    <item>
      <title>Re: Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14547#M1511</link>
      <description>&lt;P&gt;If I do get data through, it's only the line that it split on (&lt;EVENT rev="1.2&amp;quot;"&gt;)...&lt;/EVENT&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jun 2010 05:21:50 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14547#M1511</guid>
      <dc:creator>Steve_Litras</dc:creator>
      <dc:date>2010-06-02T05:21:50Z</dc:date>
    </item>
    <item>
      <title>Re: Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14548#M1512</link>
      <description>&lt;P&gt;How much data are you sending through when you test? Keep in mind with line merging rules, Splunk will hold on to events until it knows it can end one, so if SHOULD_LINEMERGE is on, it will hang on to things until it sees the BREAK_ONLY_BEFORE regex. IF you don't specify a BREAK_ONLY_BEFORE, it splits on BREAK_ONLY_BEFORE_DATE, which isn't what you want here and is more of a crapshoot.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 09:13:15 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14548#M1512</guid>
      <dc:creator>gkanapathy</dc:creator>
      <dc:date>2020-09-28T09:13:15Z</dc:date>
    </item>
    <item>
      <title>Re: Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14549#M1513</link>
      <description>&lt;P&gt;Can you also clarify, is your original data actually broken into lines or not? i.e., is the XML all on one line and you're just showing it up top split up for clarity, or are there actual newline characters breaking up the XML? And are there newlines between events. It would certainly help to see more than one event, exactly as it is in your file (use the "code" formatting button in the editor and the preview) since we're trying to split &lt;EM&gt;between&lt;/EM&gt; events.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jun 2010 05:47:22 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14549#M1513</guid>
      <dc:creator>gkanapathy</dc:creator>
      <dc:date>2010-06-02T05:47:22Z</dc:date>
    </item>
    <item>
      <title>Re: Mulltiline XML extraction...</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14550#M1514</link>
      <description>&lt;P&gt;Using BREAK_ONLY_BEFORE and MUST_BREAK_AFTER, it works properly. And there were indeed line breaks in the data, and each event is usually somewhere between 10-20 lines. &lt;/P&gt;

&lt;P&gt;Thanks for all your help!&lt;BR /&gt;
STeve&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 09:13:36 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Mulltiline-XML-extraction/m-p/14550#M1514</guid>
      <dc:creator>Steve_Litras</dc:creator>
      <dc:date>2020-09-28T09:13:36Z</dc:date>
    </item>
  </channel>
</rss>

