- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everybody,
I am facing some challenges with some custom log file containing bits of xml surrounded by some sort of headers...
The file looks something like this:
[1][DATA]BEGIN --- - 06:03:09[012]
<xml>
<tag1>value</tag1>
<nestedTag>
<tag2>another value</tag2>
</nestedTag>
</xml>
[1][DATA]END --- - 06:03:09[012]
[1][DATA]BEGIN --- - 07:03:09[123]
<xml>
<tag1>some stuff</tag1>
<nestedTag>
<tag2>other stuff</tag2>
</nestedTag>
</xml>
[1][DATA]END --- - 07:03:09[123]
[1][DATA]BEGIN --- - 08:03:09[456]
<xml>
<tag1>some more data</tag1>
<nestedTag>
<tag2>fooband a bit more</tag2>
</nestedTag>
</xml>
[1][DATA]END --- - 08:03:09[456]
It is worth noting that the xml parts can be very large.
I would like to take advantage of Splunk's automatic xml parsing as it is not realistic to do it manually in this case, but the square bracket lines around each xml block seem to prevent the xml parser to do its job and I get no field extraction.
So, what I would like to do is:
- Converting the "data begin" line with the square brackets, before each xml block, into an xml formatted line, so that I can use it for the time of the event (the date itself is encoded in the filename...) and let Splunk parse the rest of the xml data automatically
- Stripping out the lines with the "data end" bit after each block of xml. These are not useful as they provide the same time than the "data begin" line.
- Aggregating the xml lines of the same block into one event
What I have tried with props.conf and transforms.conf:
props.conf
[my_sourcetype]
BREAK_ONLY_BEFORE_DATE =
DATETIME_CONFIG =
KV_MODE = xml
LINE_BREAKER = \]([\r\n]+)\[1\]\[DATA\]BEGIN
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
category = Custom
pulldown_type = true
TRANSFORMS-full=my_transform # only with transforms.conf v1
TRANSFORMS-begin=begin # only with transforms.conf v2
TRANSFORMS-end=end # only with transforms.conf v2
transforms.conf (version 1):
[my_transform]
REGEX = (?m)\[1\]\[DATA\]BEGIN --- - (\d{2}:\d{2}:\d{2}).*([\r\n]+)([^\[]*)\[1\]\[DATA\]END.*$[\r\n]*
FORMAT = <time>$1</time>$2$3
WRITE_META = true
DEST_KEY = _raw
transforms.conf (version 2):
[begin]
REGEX = (?m)^\[1\]\[DATA\]BEGIN --- - (\d{2}:\d{2}:\d{2}).*$
FORMAT = <time>$1</time>
WRITE_META = true
DEST_KEY = _raw
[end]
REGEX = (?m)^\[1\]\[DATA\]END.*$
DEST_KEY = queue
FORMAT = nullQueue
With the various combinations listed here, I got all sorts of results:
- well separated events but with square brackets left over
- one big block with all events aggregated together and no override of the square bracket lines
- one event with the begin square bracket line truncated at 10k characters
- 4 events with one "time" xml tag but nothing else...
Could anybody help me out with this use case?
Many thanks,
Alex
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @scelikok
Thanks a lot for your reply, it was most helpful, and it helped me finding a solution.
However, I realised that the snippet I had provided had some subtle differences with the actual data, and so I had to slightly adapt your solution. That being said, I was under the impression that your regex was not quite right either as I ran it through regex101 first and it only matched the first xml block (I stripped the beginning of the square bracket line to emulate the line breaker in props.conf)
So, to recap, here is a more accurate example of the log:
[1][DATA]BEGIN --- - 06:03:09[012]
<?xml version="1.0" encoding="UTF-8"?>
<root>
<tag1>value</tag1>
<nestedTag>
<tag2>another value</tag2>
</nestedTag>
</root>
[1][DATA]END --- - 06:03:09[012]
[1][DATA]BEGIN --- - 07:03:09[123]
<?xml version="1.0" encoding="UTF-8"?>
<root>
<tag1>some stuff</tag1>
<nestedTag>
<tag2>other stuff</tag2>
</nestedTag>
</root>
[1][DATA]END --- - 07:03:09[123]
[1][DATA]BEGIN --- - 08:03:09[456]
<?xml version="1.0" encoding="UTF-8"?>
<root>
<tag1>some more data</tag1>
<nestedTag>
<tag2>fooband a bit more</tag2>
</nestedTag>
</root>
[1][DATA]END --- - 08:03:09[456]
Here is the props.conf I ended up using (as per @scelikok's suggestion):
[my_sourcetype]
LINE_BREAKER = (\[1\]\[DATA\]BEGIN[-\s]+)
SHOULD_LINEMERGE = false
TRANSFORM-transform2xml = transform2xml
KV_MODE = xml
And here is the corresponding transforms.conf, slightly tweaked - I ended up being a bit more explicit on the end of the event and removed some of the capturing groups:
[transform2xml]
REGEX = ^([^\[]+)\[\d+\][\r\n]+(<\?xml.*>[^\[]+)\[1\]\[DATA\]END --- - [\d:]+\[\d+\][\r\n]*
FORMAT = <time>$1</time>$2
DEST_KEY = _raw
It may not be a perfect xml, but that it works as expected and the xml is now automatically parsed.
Thanks again for your help @scelikok !
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
![SplunkTrust SplunkTrust](/html/@E48BE65924041B382F8C3220FF058B38/rank_icons/splunk-trust-16.png)
Hi @Alex_LC,
You can try below;
props.conf
[my_sourcetype]
LINE_BREAKER = (\[1\]\[DATA\]BEGIN[-\s]+)
SHOULD_LINEMERGE = false
TRANSFORM-transform2xml = transform2xml
KV_MODE = xml
transform.conf
[transform2xml]
REGEX = ([^\[]+)(\[\d+\][\r\n]+<xml>)([^\[]+)(<\/xml>[^$]+)
FORMAT = <xml><time>$1</time>$3</xml>
DEST_KEY = _raw
It should create a separate event for each block with time field like below;
<xml><time>08:03:09</time>
<tag1>some more data</tag1>
<nestedTag>
<tag2>fooband a bit more</tag2>
</nestedTag>
</xml>
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @scelikok
Thanks a lot for your reply, it was most helpful, and it helped me finding a solution.
However, I realised that the snippet I had provided had some subtle differences with the actual data, and so I had to slightly adapt your solution. That being said, I was under the impression that your regex was not quite right either as I ran it through regex101 first and it only matched the first xml block (I stripped the beginning of the square bracket line to emulate the line breaker in props.conf)
So, to recap, here is a more accurate example of the log:
[1][DATA]BEGIN --- - 06:03:09[012]
<?xml version="1.0" encoding="UTF-8"?>
<root>
<tag1>value</tag1>
<nestedTag>
<tag2>another value</tag2>
</nestedTag>
</root>
[1][DATA]END --- - 06:03:09[012]
[1][DATA]BEGIN --- - 07:03:09[123]
<?xml version="1.0" encoding="UTF-8"?>
<root>
<tag1>some stuff</tag1>
<nestedTag>
<tag2>other stuff</tag2>
</nestedTag>
</root>
[1][DATA]END --- - 07:03:09[123]
[1][DATA]BEGIN --- - 08:03:09[456]
<?xml version="1.0" encoding="UTF-8"?>
<root>
<tag1>some more data</tag1>
<nestedTag>
<tag2>fooband a bit more</tag2>
</nestedTag>
</root>
[1][DATA]END --- - 08:03:09[456]
Here is the props.conf I ended up using (as per @scelikok's suggestion):
[my_sourcetype]
LINE_BREAKER = (\[1\]\[DATA\]BEGIN[-\s]+)
SHOULD_LINEMERGE = false
TRANSFORM-transform2xml = transform2xml
KV_MODE = xml
And here is the corresponding transforms.conf, slightly tweaked - I ended up being a bit more explicit on the end of the event and removed some of the capturing groups:
[transform2xml]
REGEX = ^([^\[]+)\[\d+\][\r\n]+(<\?xml.*>[^\[]+)\[1\]\[DATA\]END --- - [\d:]+\[\d+\][\r\n]*
FORMAT = <time>$1</time>$2
DEST_KEY = _raw
It may not be a perfect xml, but that it works as expected and the xml is now automatically parsed.
Thanks again for your help @scelikok !
![](/skins/images/53C7C94B4DD15F7CACC6D77B9B4D55BF/responsive_peak/images/icon_anonymous_message.png)