This challenge was first posted on Slack #puzzles channel
For a previous puzzle, I needed a set of fixed-length pipe-delimited events, so I took a public domain data set, which happened to be in XML format and converted it to the required format. This puzzle has been split into multiple parts. The first part was about preparing the field template by dereferencing the field names, so that their positions could be compared. The second part was about using nested loops to process each sequence segment against all the other sequences, until the whole sequence is determined. This third part is about determining how wide each field should be (to just hold the widest value) and formatting the data with the correct justification (numerics are right-justified (space-filled) and non-numerics are left-justified). For example:
<row _id="row-57xu.68ky~aurc" ><hlrslicensenumber>17550</hlrslicensenumber><communityname>Quebec Terrace, 1015</communityname><streetaddress>1015 QUEBEC TER</streetaddress><city>SILVER SPRING</city><zipcode>20903</zipcode><casenumber>191048</casenumber><longitude>-76.988449097</longitude><latitude>39.000106812</latitude><firstinspectiondate>2025-08-20T00:00:00</firstinspectiondate><nextinspectiondate>2026-03-20T00:00:00</nextinspectiondate><inspectionfrequency>1</inspectionfrequency><compliantind>1</compliantind><unitcount>4</unitcount><unitsinspected>5</unitsinspected><averageviolationsperunit>4.4</averageviolationsperunit><severityindex>2.22</severityindex><noviolationsobserved>0</noviolationsobserved><infestedunitspercentage>0.2</infestedunitspercentage><units_with_mold>0</units_with_mold><rating>Troubled</rating></row>Should become something like:
| 17550|Quebec Terrace, 1015 |1015 QUEBEC TER |SILVER SPRING|20903|191048|-76.988449097|39.000106812|2025-08-20T00:00:00|2026-03-20T00:00:00|1|1| 4|5|4.4|2.22|0|0.2|0|Troubled |The data set I chose came from Montgomery County, Maryland, specifically, the Troubled Properties Analysis. You can retrieve an export from here. There are currently just under 700 rows in the data set.
You will need to removed the headers and ignore (filter out) extraneous elements and attributes.
This article contains spoilers!
In fact, most of this article is a spoiler as it contains partial solutions to the puzzle. If you are trying to solve the puzzle yourself and just want some pointers to get you started, stop reading when you have enough, and return if you get stuck again, or just want to compare your solution to mine!
This puzzle has been split into multiple parts. This third part is about preparing the data in the fields so that it can be used in a fixed-length (and pipe-delimited) event format. To that end, the challenge for this part is to determine the minimum width required for each field to hold all the values for that field, and to determine whether the field should be left- or right- justified. Finally, format the values in each field to the required width and justification.
The exported file from Montgomery will be a complete XML document, with a header and footer. These need to be removed before attempting to upload the file to Splunk. For example, the header might look like this:
<?xml version="1.0" encoding="UTF-8"?><response><rows>And the footer might look like this:
</rows></response>Load the file into an editor, such as Notepad++, and remove the header and footer. The file is now ready to upload into a Splunk environment. I would recommend that you do NOT upload the data to a production instance, instead, use a personal or development environment. I uploaded the data to a new index (“montgomery”) and with a new sourcetype (“troubled_properties”).
You should now have around 700 events, each with a “_raw” field containing a single XML element called “row”. The “row” element contains a flat structure i.e. single level elements.
The recommended way to parse XML is with the spath command. For this data set, spath will not only extract the single level elements which we are interested in, it will also extract the attributes of fields (including the top level field, “row”), and, depending on which version of the Montgomery data set you are using, you may get some “computed” fields (identified by the prefix “:-”). By inspecting the data set, we can see that all the first level fields that we are interested in, i.e. those without attributes and computed values, have simple word character names. This means that they can be extracted using a regular expression, so we can use something like this:
| spath
| fields - row*{@*} row.:-*
| fields row.*All the fields have a “row.” prefix, which makes them easy to refer to.
In order to create an event which is fixed length and justified, we need to determine the maximum width of each field, and whether all the data is numeric or not. So, for each of our events, we can get the length of the data in each field and check whether the data is numeric, something like this:
| foreach row.*
[| eval len_<<FIELD>>=len('<<FIELD>>')
| eval padding_<<FIELD>>=if(isnum('<<FIELD>>'),"","-")]For the non-numerics, we save a padding value of “-”. This will become clear shortly.
Now we need to determine the overriding value for the width and padding for each field. For the width, this is simply the maximum value of the corresponding len_ field from all the events, and for the padding, this is simply the minimum value of the corresponding padding_ field from all the events.
| eventstats max(len_*) as len_* min(padding_*) as padding_*
| fillnull value=""The default is to assume numeric.
We now have sufficient information in the len_* and padding_* fields to be able to reformat the data fields with the correct width and justification.
| foreach len_*
[| eval <<FIELD>>="%".'padding_<<MATCHSEG1>>'.'<<FIELD>>'."s"
| eval <<MATCHSEG1>>=printf('<<FIELD>>','<<MATCHSEG1>>')]With the printf function, when the length qualifier is preceded by a minus (-), the field is left-justified, hence the reason for using this when determining the padding earlier.
You now should have all the data in the required format to put into the final events.
Have questions or thoughts? Comment on this article or in Slack #puzzles channel. Whichever you prefer.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.