[Puzzles] Solve, Learn, Repeat: Dereferencing XML to Fixed-length events

ITWhisperer

This challenge was first posted on Slack #puzzles channel

For a previous puzzle, I needed a set of fixed-length pipe-delimited events, so I took a public domain data set, which happened to be in XML format and converted it to the required format. The overall aim of this puzzle is to convert XML event to fixed-length events, and it has been split into multiple parts. This first part is about preparing the field template so that it can be used to place the data in the correct order in the fixed-length (and pipe-delimited) events. To that end, the challenge for this part is to take some XML events and determine the correct order that the fields appear in. The approach requires determining, for each event, where each field exists in the sequence and comparing it with the position of every other field by dereferencing the field names to find their positions. An example would be the following set of events:

<row num="1600"><Mercury>0</Mercury><Venus>0</Venus><Earth>1</Earth></row>
<row num="1625"><Jupiter>97</Jupiter></row>
<row num="1675"><Saturn>274</Saturn></row>
<row num="1800"><Saturn>274</Saturn><Uranus>29</Uranus></row>
<row num="1850"><Saturn>274</Saturn><Neptune>16</Neptune></row>
<row num="1875"><Uranus>29</Uranus></row>
<row num="1900"><Earth>1</Earth><Mars>2</Mars><Jupiter>97</Jupiter><Saturn>274</Saturn></row>
<row num="1950"><Jupiter>97</Jupiter><Uranus>29</Uranus><Neptune>16</Neptune></row>
<row num="1975"><Jupiter>97</Jupiter><Saturn>274</Saturn></row>
<row num="2000"><Jupiter>97</Jupiter><Saturn>274</Saturn><Uranus>29</Uranus><Neptune>16</Neptune></row>

For example, Mercury has a lower index in the sequence, i.e. is to the left of Venus, which has a lower index in the sequence, i.e. is to the left of Earth, and, in a different sequence, Jupiter has a lower index, i.e. is to the left of Saturn.

Create a tilde-delimited template for all the fields in this set of XML events.

For a partial example

Mercury~Venus~Earth

An optional bonus question is to explain what the data represents.

This article contains spoilers!

In fact, most of this article is a spoiler as it contains partial solutions to the puzzle. If you are trying to solve the puzzle yourself and just want some pointers to get you started, stop reading when you have enough, and return if you get stuck again, or just want to compare your solution to mine!

Preparing the field template

This puzzle has been split into multiple parts. This first part is about preparing the field template so that it can be used to place the data in the correct order in the fixed-length (and pipe-delimited) events. To that end, the challenge for this part is to take some XML events and determine the correct order that the fields appear in. A simple example would be the following set of events:

<row num="1"><john>L</john><paul>M</paul><ringo>S</ringo></row>
<row num="2"><john>L</john><george>H</george><ringo>S</ringo></row>
<row num="3"><john>L</john><paul>M</paul><george>H</george></row>

Here, john has a lower index in the sequence, i.e. is to the left of paul, and paul has a lower index in the sequence, i.e. is to the left of ringo.

A more complex example would be the following set of events:

<row num="1600"><Mercury>0</Mercury><Venus>0</Venus><Earth>1</Earth></row>
<row num="1625"><Jupiter>97</Jupiter></row>
<row num="1675"><Saturn>274</Saturn></row>
<row num="1800"><Saturn>274</Saturn><Uranus>29</Uranus></row>
<row num="1850"><Saturn>274</Saturn><Neptune>16</Neptune></row>
<row num="1875"><Uranus>29</Uranus></row>
<row num="1900"><Earth>1</Earth><Mars>2</Mars><Jupiter>97</Jupiter><Saturn>274</Saturn></row>
<row num="1950"><Jupiter>97</Jupiter><Uranus>29</Uranus><Neptune>16</Neptune></row>
<row num="1975"><Jupiter>97</Jupiter><Saturn>274</Saturn></row>
<row num="2000"><Jupiter>97</Jupiter><Saturn>274</Saturn><Uranus>29</Uranus><Neptune>16</Neptune></row>

For example, Mercury has a lower index in the sequence, i.e. is to the left of Venus which has a lower index in the sequence, i.e. is to the left of Earth, and Jupiter has a lower index in the sequence, i.e. is to the left of Saturn. Create a tilde-delimited template for the fields in these sets of XML events.

Field names

By inspecting the data set, we can see that all the first level fields that we are interested in have simple word character names. This means that they can be extracted using a regular expression.

| rex max_match=0 "\<(?<fields>\w+)\>"

This creates a multi-value field (fields) with the fieldnames in the order they are found in the event. By joining these values (using a neutral delimiter, I chose a tilde (~) so it does not cause complications with regular expressions further down the line), we can create an initial template for each event.

| eval fieldnames=mvjoin(fields,"~")

Looking at these fieldname templates, you can see that they are not all the same. This is because not all fields are present in all the events.

Fortunately, fieldnames are not repeated in any single template, and there are enough examples of different templates to be able to determine the full and correct order. This is not always guaranteed as duplicated fieldnames and/or a smaller data set could present problems. For example, consider the following two events:

<row num="1"><john>L</john><paul>M</paul><ringo>S</ringo></row>
<row num="2"><john>L</john><george>H</george><ringo>S</ringo></row>

These would have the following templates:

john~paul~ringo
john~george~ringo

Trying to combine these sequence, which of these would represent the full template?

john~paul~george~ringo
john~george~paul~ringo

With these two events, there is insufficient information to determine this. But add a further event example:

<row num="3"><john>L</john><paul>M</paul><george>H</george></row>

With the following template:

john~paul~george

The full template can now be determined, even when none of the individual events has this complete template:

john~paul~george~ringo

Planets in order

Turning to Splunk now, by taking the planet data as an example, we can create a template for the fields present in each event:

| makeresults format=csv data="row
<row num=\"1600\"><Mercury>0</Mercury><Venus>0</Venus><Earth>1</Earth></row>
<row num=\"1625\"><Jupiter>97</Jupiter></row>
<row num=\"1675\"><Saturn>274</Saturn></row>
<row num=\"1800\"><Saturn>274</Saturn><Uranus>29</Uranus></row>
<row num=\"1850\"><Saturn>274</Saturn><Neptune>16</Neptune></row>
<row num=\"1875\"><Uranus>29</Uranus></row>
<row num=\"1900\"><Earth>1</Earth><Mars>2</Mars><Jupiter>97</Jupiter><Saturn>274</Saturn></row>
<row num=\"1950\"><Jupiter>97</Jupiter><Uranus>29</Uranus><Neptune>16</Neptune></row>
<row num=\"1975\"><Jupiter>97</Jupiter><Saturn>274</Saturn></row>
<row num=\"2000\"><Jupiter>97</Jupiter><Saturn>274</Saturn><Uranus>29</Uranus><Neptune>16</Neptune></row>"
``` Find all relevant field names ```
| rex field=row max_match=0 "\<(?<fields>\w+)\>"
``` Create a sequence of fields used in the event ```
| eval fieldnames=mvjoin(fields,"~")

This contains the following information:

fieldnames	fields
Mercury~Venus~Earth	Mercury Venus Earth
Jupiter	Jupiter
Saturn	Saturn
Saturn~Uranus	Saturn Uranus
Saturn~Neptune	Saturn Neptune
Uranus	Uranus
Earth~Mars~Jupiter~Saturn	Earth Mars Jupiter Saturn
Jupiter~Uranus~Neptune	Jupiter Uranus Neptune
Jupiter~Saturn	Jupiter Saturn
Jupiter~Saturn~Uranus~Neptune	Jupiter Saturn Uranus Neptune

To get the names of the planets as fields, place this information in a chart with an initial value of zero:

``` Chart the fields used in the fieldname sequences ```
| eval minimum=0
| chart max(minimum) by fieldnames fields

fieldnames	Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus
Earth~Mars~Jupiter~Saturn	0	0	0			0
Jupiter		0
Jupiter~Saturn		0				0
Jupiter~Saturn~Uranus~Neptune		0			0	0	0
Jupiter~Uranus~Neptune		0			0		0
Mercury~Venus~Earth	0			0				0
Saturn						0
Saturn~Neptune					0	0
Saturn~Uranus						0	0
Uranus							0

Find the index of each named field in each field sequence:

``` Find the index of each field named in each sequence ```
| eval fields=split(fieldnames,"~")
| rename fieldnames as _fieldnames, fields as _fields
| foreach *
    [| eval <<FIELD>>=mvfind(_fields,"<<FIELD>>")]

Note the rename of fieldnames and fields, so that they do not get picked up by the * on the foreach command.

Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus
0	2	1			3
	0
	0				1
	0			3	1	2
	0			2		1
2			0				1
					0
				1	0
					0
						0

Notice that some planets have multiple entries and that not all the entries for a planet are the same. This is because the planet appears in more than one sequence and that there may be different numbers of planets appearing before it in the sequence. So, which is right? In fact, nearly all of them are wrong. How do we find the correct order?

Finding the right order

To find the right order, we can start with the maximum value for each planet, i.e. the one representing where it is furthest to the right in any sequence.

``` Find the maximum index of each field named in each sequence i.e. the furthest to the right that the field appears in any sequence ```
| eventstats max(*) as *

Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus
2	2	1	0	3	3	2	1

Rows removed from tables to save space.

So far, we have been dealing with the sequences as a whole; now we need to deal with each field in each sequence separately. What we would really like to do is find the value of the index for the first field named in the sequence, and add 1 to it for each of the remaining fields listed in the sequence.

Reverse dereferencing

Unfortunately, SPL does not have a direct way to dereference the field name to find the corresponding field value. So, we will have to try a different approach. Start by expanding out the field names used in each sequence so we can process them separately:

``` Split out the fieldnames used in each sequence ```
| rename _fieldnames as fieldnames, _fields as fields
| mvexpand fields

fieldnames	Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus	fields
Earth~Mars~Jupiter~Saturn	2	2	1	0	3	3	2	1	Earth
Earth~Mars~Jupiter~Saturn	2	2	1	0	3	3	2	1	Mars
Earth~Mars~Jupiter~Saturn	2	2	1	0	3	3	2	1	Jupiter
Earth~Mars~Jupiter~Saturn	2	2	1	0	3	3	2	1	Saturn

Find the position of each field in its sequence (as an index):

``` Find the current position of the field in its sequence ```
| streamstats count as _position by fieldnames
``` Convert to index ```
| eval _position=_position-1

Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus	fields	position
2	2	1	0	3	3	2	1	Earth	0
2	2	1	0	3	3	2	1	Mars	1
2	2	1	0	3	3	2	1	Jupiter	2
2	2	1	0	3	3	2	1	Saturn	3

For each sequence, find the current rightmost position of the starting field (effectively, dereferencing the first named planet field):

``` For each sequence, find the current rightmost position of the start ```
| rename fieldnames as _fieldnames, fields as _fields
| fields - _start
| foreach *
    [| eval _start=if("<<FIELD>>"=_fields AND _position=0,<<FIELD>>,_start)]
| eventstats min(_start) as _start by _fieldnames

Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus	fields	position	start
2	2	1	0	3	3	2	1	Earth	0	2
2	2	1	0	3	3	2	1	Mars	1	2
2	2	1	0	3	3	2	1	Jupiter	2	2
2	2	1	0	3	3	2	1	Saturn	3	2

Calculate the new position based on the relative position in the sequence and the rightmost position of the first field

``` Calculate the new position based on the relative position in the sequence and the rightmost position of the first field ```
| eval _new_position=_start+_position

Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus	fields	new position	position	start
2	2	1	0	3	3	2	1	Earth	2	0	2
2	2	1	0	3	3	2	1	Mars	3	1	2
2	2	1	0	3	3	2	1	Jupiter	4	2	2
2	2	1	0	3	3	2	1	Saturn	5	3	2

When the new position is greater than the field's current rightmost position, update the field's position:

``` When the new position is greater than the field's current rightmost position, update the field's position ```
| foreach *
    [| eval <<FIELD>>=if("<<FIELD>>"=_fields AND _new_position><<FIELD>>,_new_position,<<FIELD>>)]

Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus	fields	new position	position	start
2	2	1	0	3	3	2	1	Earth	2	0	2
2	2	3	0	3	3	2	1	Mars	3	1	2
2	4	1	0	3	3	2	1	Jupiter	4	2	2
2	2	1	0	3	5	2	1	Saturn	5	3	2

Find the new maximum index of each field named in each sequence

``` Find the new maximum index of each field named in each sequence ```
| eventstats max(*) as *

Earth	Jupiter	Mars	Mercury	Neptune	Saturn	Uranus	Venus	fields	new position	position	start
2	4	3	0	5	5	4	1	Earth	2	0	2
2	4	3	0	5	5	4	1	Mars	3	1	2
2	4	3	0	5	5	4	1	Jupiter	4	2	2
2	4	3	0	5	5	4	1	Saturn	5	3	2

Are we done yet?

There are a couple of ways to tell whether we have found the correct order yet (which can be done visually, or as a stretch exercise). Firstly, more than one planet has the same position as another, and secondly, the maximum index position is less than 7. Since we do not appear to have finished, I will leave you to work out which steps need to be repeated, and by how many times.

Generating the final template?

Once all the necessary repetitions have been done, we need to convert the information into a template. This can be done by finding the final position for each field and sorting them.

``` Find the final position for each field and sort them ```
| rename _fieldnames as fieldnames
| untable fieldnames field position
| stats values(position) as position by field
| sort 0 position

Create a template from the sorted list of fields

``` Create a template from the sorted list of fields ```
| stats list(field) as fields
| eval fields=mvjoin(fields,"~")

You should now have a process for generating the template required for the planets data set.

Have questions or thoughts? Comment on this article or in Slack #puzzles channel. Whichever you prefer.

[Puzzles] Solve, Learn, Repeat: Dereferencing XML to Fixed-length events

Preparing the field template

Field names

Planets in order

Finding the right order

Reverse dereferencing

Are we done yet?

Generating the final template?

Accelerating Observability as Code with the Splunk AI Assistant

Integrating Splunk Search API and Quarto to Create Reproducible Investigation ...

Congratulations to the 2025-2026 SplunkTrust!

Join the Conversation

[Puzzles] Solve, Learn, Repeat: Dereferencing XML to Fixed-length events

Preparing the field template

Field names

Planets in order

Finding the right order

Reverse dereferencing

Are we done yet?

Generating the final template?

Accelerating Observability as Code with the Splunk AI Assistant

Integrating Splunk Search API and Quarto to Create Reproducible Investigation ...

Congratulations to the 2025-2026 SplunkTrust!