I have searched splunkbase for a complete example of how to build an XML as an input and split the XML up where it needs to be split for events. It appears there isn't a clean method for doing this in splunk. Every example has been a regular expression hack that just does not seem to work when I copy the example and replace the XML element names for the element names that contain my events that I would like to extract out of my XML file.
The best practice http://wiki.splunk.com/Deploy:HowToWorkWithXMLLogFiles does not appear to work when I change the tag in the example to be my tag.
Why can't I find better examples of how to have XML as source inputs for splunk? It seems to build any kind of input I have to be a splunk expert.
For example in XML there are outputs that have no namespaces.
There there are outputs that do have namespaces.
I am not able to derive regular expressions and figure out the right line break or line break before and only before with the right combination of line merge with max events allowed but inject this made up line break to keep the magic parsing engine happy.
Where is the xpath type syntax that would let me say. splunk if you execute this regular expression each result will be an individual event that you can use?
end rant...
How does one define in the props.conf a way to input XML in splunk?
-------- more details
Let's say I have files generated by an application containing XML format logs. Each file contains a collection of events that should be processed as individual events by splunk.
For example
<!-- log data .... -->
<!-- log data .... -->
<!-- log data .... -->
<!-- log data .... -->
<!-- log data .... -->
Each EventLogItem should be an event in splunk.
Where XML differs from flat files is that the whitespace should be ignored between the XML elements. The file may or may not have pretty formatted XML.
The XML examples also do not account for cases where the XML might be prefixed instead of declared under the default namespace.
Example.
This is where splunk's use of regular expressions to parse XML become difficult. Regular expressions are not always the easiest form of parsing that everyone is able to pick up and use quickly.
The regular expression for XML should not assume XML tags begin at the start of a line.
the solution should also omit the root element. it is only there to group the events in the XML document and does not need to be in splunk as data.
Assuming that your events look something like this:
<root><event>....</event><event>.....</event></root>
And the sourcetype of the input has been defined as "myxml" - you could put the following in $SPLUNK_HOME/etc/system/local/props.conf
[myxml]
SHOULD_LINEMERGE=true
BREAK_ONLY_BEFORE_DATE=false
MAX_EVENTS = 1024
MUST_BREAK_AFTER =\</root>
This says "Break after any line that contains the terminating root element"
Note that Splunk assigns whole lines to events, not partial lines. You cannot break an event in the middle of a line. Also note that Splunk only allows 256 lines max per event, by default. I have set that higher here.
There is no xpath processing at the time events are indexed. I don't know why, but I suspect that it would simply take too long. Regex is much much faster.
Once you have the events in Splunk, there is an spath
command and an xmlkv
command, to make working with XML data easier.
Try setting it this way
MUST_BREAK_AFTER=\
The backslash at the beginning is essential
The MAX_EVENTS setting does not control the maximum number of events; it controls the maximum number of lines allowed in multi-line events. So perhaps it is not needed here, but I usually mention it for XML inputs because the default is often too small. I doubt that it is causing the problems.
This will not work as it does not pull
When MUST_BREAK_AFTER is set to use things become really strange in the events possibly due to the MAX_EVENTS setting which becomes strange to manage when dealing with XML.
Would it work to set MUST_BREAK_AFTER=</event> to have splunk receive each event XML fragment as a splunk event?
I will give this a test and see what happens.
I might have found a typo in the wiki.
The community would be happy to help, if we understood a little bit more about your data.
I normally would not say "read this: http://docs.splunk.com/Documentation/Splunk/latest/Data/Indexmulti-lineevents
"
but I have little information on which to base other suggestions.