Solved: Transforms on multiline event containing some xml

Alex_LC · ‎12-19-2024

Hello everybody,

I am facing some challenges with some custom log file containing bits of xml surrounded by some sort of headers...

The file looks something like this:

[1][DATA]BEGIN ---  - 06:03:09[012]
<xml>
    <tag1>value</tag1>
    <nestedTag>
        <tag2>another value</tag2>
    </nestedTag>
</xml>
[1][DATA]END   ---  - 06:03:09[012]

[1][DATA]BEGIN ---  - 07:03:09[123]
<xml>
    <tag1>some stuff</tag1>
    <nestedTag>
        <tag2>other stuff</tag2>
    </nestedTag>
</xml>
[1][DATA]END   ---  - 07:03:09[123]

[1][DATA]BEGIN ---  - 08:03:09[456]
<xml>
    <tag1>some more data</tag1>
    <nestedTag>
        <tag2>fooband a bit more</tag2>
    </nestedTag>
</xml>
[1][DATA]END   ---  - 08:03:09[456]

It is worth noting that the xml parts can be very large.

I would like to take advantage of Splunk's automatic xml parsing as it is not realistic to do it manually in this case, but the square bracket lines around each xml block seem to prevent the xml parser to do its job and I get no field extraction.

So, what I would like to do is:

Converting the "data begin" line with the square brackets, before each xml block, into an xml formatted line, so that I can use it for the time of the event (the date itself is encoded in the filename...) and let Splunk parse the rest of the xml data automatically
Stripping out the lines with the "data end" bit after each block of xml. These are not useful as they provide the same time than the "data begin" line.
Aggregating the xml lines of the same block into one event

What I have tried with props.conf and transforms.conf:

props.conf

[my_sourcetype]
BREAK_ONLY_BEFORE_DATE =
DATETIME_CONFIG =
KV_MODE = xml
LINE_BREAKER = \]([\r\n]+)\[1\]\[DATA\]BEGIN
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
category = Custom
pulldown_type = true
TRANSFORMS-full=my_transform # only with transforms.conf v1
TRANSFORMS-begin=begin # only with transforms.conf v2
TRANSFORMS-end=end # only with transforms.conf v2

transforms.conf (version 1):

[my_transform]
REGEX = (?m)\[1\]\[DATA\]BEGIN ---  - (\d{2}:\d{2}:\d{2}).*([\r\n]+)([^\[]*)\[1\]\[DATA\]END.*$[\r\n]*
FORMAT = <time>$1</time>$2$3
WRITE_META = true
DEST_KEY = _raw

transforms.conf (version 2):

[begin]
REGEX = (?m)^\[1\]\[DATA\]BEGIN ---  - (\d{2}:\d{2}:\d{2}).*$
FORMAT = <time>$1</time>
WRITE_META = true
DEST_KEY = _raw

[end]
REGEX = (?m)^\[1\]\[DATA\]END.*$
DEST_KEY = queue
FORMAT = nullQueue

With the various combinations listed here, I got all sorts of results:

well separated events but with square brackets left over
one big block with all events aggregated together and no override of the square bracket lines
one event with the begin square bracket line truncated at 10k characters
4 events with one "time" xml tag but nothing else...

Could anybody help me out with this use case?

Many thanks,

Alex

Alex_LC · ‎12-20-2024

Hi @scelikok

Thanks a lot for your reply, it was most helpful, and it helped me finding a solution.

However, I realised that the snippet I had provided had some subtle differences with the actual data, and so I had to slightly adapt your solution. That being said, I was under the impression that your regex was not quite right either as I ran it through regex101 first and it only matched the first xml block (I stripped the beginning of the square bracket line to emulate the line breaker in props.conf)

So, to recap, here is a more accurate example of the log:

[1][DATA]BEGIN --- - 06:03:09[012]
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>value</tag1>
  <nestedTag>
    <tag2>another value</tag2>
  </nestedTag>
</root>
[1][DATA]END --- - 06:03:09[012]

[1][DATA]BEGIN --- - 07:03:09[123]
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>some stuff</tag1>
  <nestedTag>
    <tag2>other stuff</tag2>
  </nestedTag>
</root>
[1][DATA]END --- - 07:03:09[123]

[1][DATA]BEGIN --- - 08:03:09[456]
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>some more data</tag1>
  <nestedTag>
    <tag2>fooband a bit more</tag2>
  </nestedTag>
</root>
[1][DATA]END --- - 08:03:09[456]

Here is the props.conf I ended up using (as per @scelikok's suggestion):

[my_sourcetype]
LINE_BREAKER = (\[1\]\[DATA\]BEGIN[-\s]+)
SHOULD_LINEMERGE = false
TRANSFORM-transform2xml = transform2xml
KV_MODE = xml

And here is the corresponding transforms.conf, slightly tweaked - I ended up being a bit more explicit on the end of the event and removed some of the capturing groups:

[transform2xml]
REGEX = ^([^\[]+)\[\d+\][\r\n]+(<\?xml.*>[^\[]+)\[1\]\[DATA\]END --- - [\d:]+\[\d+\][\r\n]*
FORMAT = <time>$1</time>$2
DEST_KEY = _raw

It may not be a perfect xml, but that it works as expected and the xml is now automatically parsed.

Thanks again for your help @scelikok !

View solution in original post

scelikok · ‎12-19-2024

Hi @Alex_LC,

You can try below;

props.conf

[my_sourcetype]
LINE_BREAKER = (\[1\]\[DATA\]BEGIN[-\s]+)
SHOULD_LINEMERGE = false
TRANSFORM-transform2xml = transform2xml
KV_MODE = xml

transform.conf

[transform2xml]
REGEX = ([^\[]+)(\[\d+\][\r\n]+<xml>)([^\[]+)(<\/xml>[^$]+)
FORMAT = <xml><time>$1</time>$3</xml>
DEST_KEY = _raw

It should create a separate event for each block with time field like below;

<xml><time>08:03:09</time>
    <tag1>some more data</tag1>
    <nestedTag>
        <tag2>fooband a bit more</tag2>
    </nestedTag>
</xml>

If this reply helps you an upvote and "Accept as Solution" is appreciated.

Alex_LC · ‎12-20-2024

Hi @scelikok

Thanks a lot for your reply, it was most helpful, and it helped me finding a solution.

However, I realised that the snippet I had provided had some subtle differences with the actual data, and so I had to slightly adapt your solution. That being said, I was under the impression that your regex was not quite right either as I ran it through regex101 first and it only matched the first xml block (I stripped the beginning of the square bracket line to emulate the line breaker in props.conf)

So, to recap, here is a more accurate example of the log:

[1][DATA]BEGIN --- - 06:03:09[012]
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>value</tag1>
  <nestedTag>
    <tag2>another value</tag2>
  </nestedTag>
</root>
[1][DATA]END --- - 06:03:09[012]

[1][DATA]BEGIN --- - 07:03:09[123]
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>some stuff</tag1>
  <nestedTag>
    <tag2>other stuff</tag2>
  </nestedTag>
</root>
[1][DATA]END --- - 07:03:09[123]

[1][DATA]BEGIN --- - 08:03:09[456]
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>some more data</tag1>
  <nestedTag>
    <tag2>fooband a bit more</tag2>
  </nestedTag>
</root>
[1][DATA]END --- - 08:03:09[456]

Here is the props.conf I ended up using (as per @scelikok's suggestion):

[my_sourcetype]
LINE_BREAKER = (\[1\]\[DATA\]BEGIN[-\s]+)
SHOULD_LINEMERGE = false
TRANSFORM-transform2xml = transform2xml
KV_MODE = xml

And here is the corresponding transforms.conf, slightly tweaked - I ended up being a bit more explicit on the end of the event and removed some of the capturing groups:

[transform2xml]
REGEX = ^([^\[]+)\[\d+\][\r\n]+(<\?xml.*>[^\[]+)\[1\]\[DATA\]END --- - [\d:]+\[\d+\][\r\n]*
FORMAT = <time>$1</time>$2
DEST_KEY = _raw

It may not be a perfect xml, but that it works as expected and the xml is now automatically parsed.

Thanks again for your help @scelikok !

Transforms on multiline event containing some xml

field extraction

props.conf

transforms.conf

XML

Technical Workshop Series: Splunk Data Management and SPL2 | Register here!

Spotting Financial Fraud in the Haystack: A Guide to Behavioral Analytics with Splunk

Solve Problems Faster with New, Smarter AI and Integrations in Splunk Observability