Getting Data In

splunk why must XML sources be so complicated?

zenSplunk
Explorer

I have searched splunkbase for a complete example of how to build an XML as an input and split the XML up where it needs to be split for events. It appears there isn't a clean method for doing this in splunk. Every example has been a regular expression hack that just does not seem to work when I copy the example and replace the XML element names for the element names that contain my events that I would like to extract out of my XML file.

The best practice http://wiki.splunk.com/Deploy:HowToWorkWithXMLLogFiles does not appear to work when I change the tag in the example to be my tag.

Why can't I find better examples of how to have XML as source inputs for splunk? It seems to build any kind of input I have to be a splunk expert.

For example in XML there are outputs that have no namespaces.
.........

There there are outputs that do have namespaces.
.........

I am not able to derive regular expressions and figure out the right line break or line break before and only before with the right combination of line merge with max events allowed but inject this made up line break to keep the magic parsing engine happy.

Where is the xpath type syntax that would let me say. splunk if you execute this regular expression each result will be an individual event that you can use?

end rant...

How does one define in the props.conf a way to input XML in splunk?

-------- more details
Let's say I have files generated by an application containing XML format logs. Each file contains a collection of events that should be processed as individual events by splunk.

For example


<!-- log data .... -->

<!-- log data .... -->

<!-- log data .... -->


<!-- log data .... -->


<!-- log data .... -->

Each EventLogItem should be an event in splunk.

Where XML differs from flat files is that the whitespace should be ignored between the XML elements. The file may or may not have pretty formatted XML.

<!-- log data .... --><!-- log data .... --><!-- log data .... --><!-- log data .... --><!-- log data .... -->

The XML examples also do not account for cases where the XML might be prefixed instead of declared under the default namespace.

Example.

This is where splunk's use of regular expressions to parse XML become difficult. Regular expressions are not always the easiest form of parsing that everyone is able to pick up and use quickly.

The regular expression for XML should not assume XML tags begin at the start of a line.

zenSplunk
Explorer

the solution should also omit the root element. it is only there to group the events in the XML document and does not need to be in splunk as data.

0 Karma

lguinn2
Legend

Assuming that your events look something like this:

<root><event>....</event><event>.....</event></root>

And the sourcetype of the input has been defined as "myxml" - you could put the following in $SPLUNK_HOME/etc/system/local/props.conf

[myxml]
SHOULD_LINEMERGE=true
BREAK_ONLY_BEFORE_DATE=false
MAX_EVENTS = 1024
MUST_BREAK_AFTER =\</root>

This says "Break after any line that contains the terminating root element"

Note that Splunk assigns whole lines to events, not partial lines. You cannot break an event in the middle of a line. Also note that Splunk only allows 256 lines max per event, by default. I have set that higher here.

There is no xpath processing at the time events are indexed. I don't know why, but I suspect that it would simply take too long. Regex is much much faster.

Once you have the events in Splunk, there is an spath command and an xmlkv command, to make working with XML data easier.

lguinn2
Legend

Try setting it this way

MUST_BREAK_AFTER=\

The backslash at the beginning is essential

0 Karma

lguinn2
Legend

The MAX_EVENTS setting does not control the maximum number of events; it controls the maximum number of lines allowed in multi-line events. So perhaps it is not needed here, but I usually mention it for XML inputs because the default is often too small. I doubt that it is causing the problems.

0 Karma

zenSplunk
Explorer

This will not work as it does not pull nodes out as individual events for splunk.

When MUST_BREAK_AFTER is set to use things become really strange in the events possibly due to the MAX_EVENTS setting which becomes strange to manage when dealing with XML.

0 Karma

zenSplunk
Explorer

Would it work to set MUST_BREAK_AFTER=</event> to have splunk receive each event XML fragment as a splunk event?

I will give this a test and see what happens.

0 Karma

lguinn2
Legend

I might have found a typo in the wiki.

The community would be happy to help, if we understood a little bit more about your data.

I normally would not say "read this: http://docs.splunk.com/Documentation/Splunk/latest/Data/Indexmulti-lineevents
"
but I have little information on which to base other suggestions.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...