[Puzzles] Solve, Learn, Repeat: Unmerging HTML Tables

ITWhisperer

[Puzzles] Solve, Learn, Repeat: Unmerging HTML Tables

For a previous puzzle, I needed some sample data, and while researching for this, I came across the data I was interested in, but it was in an HTML Table. This inspired me to create this puzzle. The challenge is to take an HTML table and convert it to a Splunk table. The HTML (your input) looks like this:

<table><tbody>
  <tr><td colspan="5">1980s</td></tr>
  <tr><th>Imaged</th><th>Published</th><th>Name</th><th>Designation</th><th>Discoverers</th></tr>
  <tr><td>30 December 1985</td><td>9 January 1986</td><td>Puck</td><td>Uranus XV</td><td rowspan="7">Synnott, Voyager 2</td></tr>
  <tr><td rowspan="2">3 January 1986</td><td rowspan="6">16 January 1986</td><td>Juliet</td><td>Uranus XI</td></tr>
  <tr><td>Portia</td><td>Uranus XII</td></tr>
  <tr><td>9 January 1986</td><td>Cressida</td><td>Uranus IX</td></tr>
  <tr><td rowspan="3">13 January 1986</td><td>Desdemona</td><td>Uranus X</td></tr>
  <tr><td>Rosalind</td><td>Uranus XIII</td></tr>
  <tr><td>Belinda</td><td>Uranus XIV</td></tr>
  <tr><td rowspan="2">20 January 1986</td><td rowspan="3">27January 1986</td><td>Cordelia</td><td>Uranus VI</td><td rowspan="2">Terrile, Voyager 2</td></tr>
  <tr><td>Ophelia</td><td>Uranus VII</td></tr>
  <tr><td>23 January 1986</td><td>Bianca</td><td>Uranus VIII</td><td>Smith, Voyager 2</td></tr>
  <tr><th>Imaged</th><th>Published</th><th>Name</th><th>Designation</th><th>Discoverers</th></tr>
</tbody></table>

This can be crudely rendered like this:

+------------------+-----------------+-----------+-------------+--------------------+
|                                         1980s                                     |
+==================+=================+===========+=============+====================+
| Imaged           | Published       | Name      | Designation | Discoverers        |
+==================+=================+===========+=============+====================+
| 30 December 1985 | 9 January 1986  | Puck      | Uranus XV   | Synnott, Voyager 2 |
+------------------+-----------------+-----------+-------------+                    +
| 3 January 1986   | 16 January 1986 | Juliet    | Uranus XI   |                    |
+                  +                 +-----------+-------------+                    +
|                  |                 | Portia    | Uranus XII  |                    |
+------------------+                 +-----------+-------------+                    +
| 9 January 1986   |                 | Cressida  | Uranus IX   |                    |
+------------------+                 +-----------+-------------+                    +
| 13 January 1986  |                 | Desdemona | Uranus X    |                    |
+                  +                 +-----------+-------------+                    +
|                  |                 | Rosalind  | Uranus XIII |                    |
+                  +                 +-----------+-------------+                    +
|                  |                 | Belinda   | Uranus XIV  |                    |
+------------------+-----------------+-----------+-------------+--------------------+
| 20 January 1986  | 27 January 1986 | Cordelia  | Uranus VI   | Terrile, Voyager 2 |
+                  +                 +-----------+-------------+                    +
|                  |                 | Ophelia   | Uranus VII  |                    |
+------------------+                 +-----------+-------------+--------------------+
| 23 January 1986  |                 | Bianca    | Uranus VIII | Smith, Voyager 2   |
+==================+=================+===========+=============+====================+
| Imaged           | Published       | Name      | Designation | Discoverers        |
+==================+=================+===========+=============+====================+

The challenge is to parse the HTML data, unmerge the cells and produce a table like this (using the header data for the field names):

Imaged	Published	Name	Designation	Discoverers
1980s	1980s	1980s	1980s	1980s
30 December 1985	9 January 1986	Puck	Uranus XV	Synnott, Voyager 2
3 January 1986	16 January 1986	Juliet	Uranus XI	Synnott, Voyager 2
3 January 1986	16 January 1986	Portia	Uranus XII	Synnott, Voyager 2
9 January 1986	16 January 1986	Cressida	Uranus IX	Synnott, Voyager 2
13 January 1986	16 January 1986	Desdemona	Uranus X	Synnott, Voyager 2
13 January 1986	16 January 1986	Rosalind	Uranus XIII	Synnott, Voyager 2
13 January 1986	16 January 1986	Belinda	Uranus XIV	Synnott, Voyager 2
20 January 1986	27 January 1986	Cordelia	Uranus VI	Terrile, Voyager 2
20 January 1986	27 January 1986	Ophelia	Uranus VII	Terrile, Voyager 2
23 January 1986	27 January 1986	Bianca	Uranus VIII	Smith, Voyager 2

This article contains spoilers!

In fact, the whole article is a spoiler as it contains solutions to the puzzle. If you are trying to solve the puzzle yourself and just want some pointers to get you started, stop reading when you have enough, and return if you get stuck again, or just want to compare your solution to mine!

HTML Table data

As you probably already know, HTML table data is structured hierarchically:

<table>
	<tbody>
		<tr>
			<td>
				Cell data
			</td>
		</tr>
	</tbody>
</table>

The tr (table row) elements can be repeated multiple times, and the td (table data) elements can be repeated multiple times and/or replaced by th (table header) elements.

The td (and th) elements can have optional spanning attributes to declare how many columns or rows the data is spread across (similar to merging cells in a spreadsheet).

Given that the target format is a table with rows and columns, we need to extract the rows from the HTML data, then extract the cell data for each column from the row data.

| makeresults
| fields - _time
| eval _raw="<table><tbody>
  <tr><td colspan=\"5\">1980s</td></tr>
  <tr><th>Imaged</th><th>Published</th><th>Name</th><th>Designation</th><th>Discoverers</th></tr>
  <tr><td>30 December 1985</td><td>9 January 1986</td><td>Puck</td><td>Uranus XV</td><td rowspan=\"7\">Synnott, Voyager 2</td></tr>
  <tr><td rowspan=\"2\">3 January 1986</td><td rowspan=\"6\">16 January 1986</td><td>Juliet</td><td>Uranus XI</td></tr>
  <tr><td>Portia</td><td>Uranus XII</td></tr>
  <tr><td>9 January 1986</td><td>Cressida</td><td>Uranus IX</td></tr>
  <tr><td rowspan=\"3\">13 January 1986</td><td>Desdemona</td><td>Uranus X</td></tr>
  <tr><td>Rosalind</td><td>Uranus XIII</td></tr>
  <tr><td>Belinda</td><td>Uranus XIV</td></tr>
  <tr><td rowspan=\"2\">20 January 1986</td><td rowspan=\"3\">27 January 1986</td><td>Cordelia</td><td>Uranus VI</td><td rowspan=\"2\">Terrile, Voyager 2</td></tr>
  <tr><td>Ophelia</td><td>Uranus VII</td></tr>
  <tr><td>23 January 1986</td><td>Bianca</td><td>Uranus VIII</td><td>Smith, Voyager 2</td></tr>
  <tr><th>Imaged</th><th>Published</th><th>Name</th><th>Designation</th><th>Discoverers</th></tr>
</tbody></table>"
``` Extract each row and expand into separate events ```
| rex max_match=0 "(?ms)(?<row>\<tr.*?\<\/tr\>)"
| fields - _raw
| mvexpand row

Next, give each row a unique identifier

``` Give each row a unique identifier ```
| streamstats count as rownum

Now, extract the cell data from each row event

``` Extract the cell data from the row and expand into separate events ```
| rex field=row max_match=0 "(?ms)(?<data>\<(th|td).*?\<\/\2\>)"
| fields - row
| mvexpand data

There could be spanning attributes, so extract those

``` Extract the spanning information ```
| rex field=data "colspan=\"(?<colspan>\d+)\""
| rex field=data "rowspan=\"(?<rowspan>\d+)\""

Unmerging columns

In order to unmerge a column, we simply have to replicate the column data the requisite number of times.

``` Replicate cells which span multiple columns ```
| eval expand=mvrange(0, colspan)
| mvexpand expand

Now, give each column within the row a unique identifier

``` Give each column in the row a unique identifier ```
| streamstats count as colnum by rownum

Note that this identifies the column within the HTML table row, and it may not be the final column number in the target Splunk table because cells from above may have been spanned and actually appear before this cell.

Unmerging rows

This is where things start to get tricky. When cells span multiple rows in HTML tables, there is no corresponding td element in the HTML tr structure in the spanned rows. This means that the unique identifier which is based on the count of td elements in the original row may not accurately reflect the desired column position. When rows are unmerged, cells are added to subsequent rows, but the columns in the rows have already been unmerged and tagged with unique identifiers! How do we know which order the td elements should appear in the row when the rows are unmerged?

The first thing to do is to replicate the cell data the requisite number of times

``` Replicate cells which span multiple rows ```
| eval expand=mvrange(0, rowspan)
| mvexpand expand

Calculate a target row number based on original row number and whether it is part of a row span expansion

``` Calculate target row number ```
| eval new_rownum=rownum+expand

Collating the row data

To collate the row data, we need to know how many column fields are needed

``` Find the maximum number of columns ```
| eventstats max(colnum) as max_columns

Create a multi-value field with indexes for each of the column fields

``` Create a multi--value field with indexes for each column ```
| eval max_columns=mvrange(0,max_columns)

Duplicate the multi-value field and expand it to produce an event for each column in each row

``` Duplicate and expand column indexes ```
| eval columns=max_columns
| mvexpand columns

Since we need the fields to be named from the th elements, extract these names and create a list. If there are multiple rows with th elements, only names from the first row will be used.

``` Extract header names from the data and list them for each cell ```
| rex field=data "\<th>(?<th>[^<]+)"
| eventstats list(th) as column by columns

Now that we have extracted the field names from the th elements, these events are no longer required

``` Remove header rows as they are now redundant ```
| where isnull(th)

Find the corresponding column name using the unique identifier for the column as an index into the list of possible column field names

``` Find the corresponding column name based on the unique identifier for the column ```
| eval colname=mvindex(column,columns)

Create a new field using the unique identifier for the column (to maintain the order of the fields) and the discovered column field name

``` Create a new field based on the unique identifier and the column name holding the column index ```
| eval f_{columns}_{colname}=columns

Note that this works because there are no more than 10 fields (indexed 0 - 9). Any more than this, and the unique identifier would need to be zero-padded to maintain the order lexicographically.

Compress the cell information, gathering the cell data, the column field identifiers in the column field name fields and the multi-value field with all the column identifiers, by row and column

``` Compress the cell information ```
| stats values(data) as data values(f_*) as f_* values(max_columns) as max_columns by rownum colnum new_rownum

Sort the events by original row number and column number to maintain correct precedence, i.e. the order in which the original rows (tr) were defined in the input

``` Sort by original row identifier and column identifier to ensure correct precedence when unmerging cells ```
| sort 0 rownum colnum

Compress the row information using lists for column numbers, and data, and values for all the column identifiers and column field name fields by target row number

``` Compress the row information ```
| stats list(colnum) as colnum list(data) as data values(max_columns) as max_columns values(f_*) as f_* by new_rownum

Notice that the rows are sorted lexicographically, so re-sort to fix that

``` Notice that the rows are sorted lexicographically so let us fix that ```
| sort 0 new_rownum

Building the table

We now have sufficient information about each row to be able to build the table. For each column field, we need to find the data associated with the left-most column, including those which have been spanned down from prior rows, where the column position is not too far to the right.

Using nested loops (as previously used in another puzzle solution), we can search through the list of column numbers until we find one which does not exceed the column number we are looking for.

When a suitable column is found, update the field (removing HTML tags in the process); remove the found data from the list of remaining (unclaimed) data; and, remove the corresponding original column number from the list of remaining (unclaimed) column numbers.

``` For each of the final fields ```
| foreach f_*
    [
    ``` Create an ordered list of data where data for a column from a prior row takes precedence ```
    ``` Set a flag to start the process of finding the correct data for the field ```
    | eval found=0
    ``` For each column index ```
    | foreach mode=multivalue max_columns
        [
        ``` Since multivalue mode only allows one command, there is a lot going on here:
        If still looking for field data:
            If current field index + 1 >= original column number:
                Update flag to say we have found the data index we are looking for
            If found the data index:
                Update field with the data we found, removing HTML tags at the same time
                Remove the found data from the remaining (unclaimed) data list
                Remove  the found original column number from the remaining (unclaimed) column number list
                Update flag to say we have finished processing for this field
        ```
        | eval found=if(found=0 AND <<FIELD>>+1>=tonumber(mvindex(colnum,<<ITEM>>)),1,found), <<FIELD>>=if(found=1,replace(mvindex(data,<<ITEM>>),"(\<td[^>]*>)(.*)(\<\/td>)","\2"),<<FIELD>>), data=if(found=1,if(<<ITEM>>=0,mvindex(data,1,-1),mvappend(mvindex(data,0,<<ITEM>>-1),mvindex(data,<<ITEM>>+1,-1))),data), colnum=if(found=1,if(<<ITEM>>=0,mvindex(colnum,1,-1),mvappend(mvindex(colnum,0,<<ITEM>>-1),mvindex(colnum,<<ITEM>>+1,-1))),colnum), found=if(found=1,2,found)
        ]
    ]

Now, all the data is in the correct column field so we can discard all the other fields

``` Since all the data has been moved, reduce the events to just the fields we are interested in ```
| table f_*

Tidying up with double transpose

The fields we are left with contain the column names from the td elements but they also contain the column numbers so we want to rename them. Since we know how many fields we have, we could hard-code each rename

| rename f_0_* as *, f_1_* as *, …

However, this should be done without hard-coding. We could try copying the fields to fields with just the header name part (which is in the second match segment)

| foreach f_*_*
    [| eval _name="<<MATCHSEG2>>"
    | eval {_name}=<<FIELD>>]
| fields - f_*

Unfortunately, this does not quite work because the fields in the table are re-ordered lexicographically.

Instead, we can transpose the events twice, modifying the column name in between

``` Instead, we can use transpose twice, modifying the column field name in between ```
| transpose 0 column_name=header
| eval header=mvindex(split(header,"_"),2)
| transpose 0 header_field=header
| fields - column

This gives the renamed fields whilst maintaining the order they were listed in the th elements.

Note that this double transpose method works because of the limited number of rows and columns.

Summary

In summary, the key to solving this puzzle is to recognise how column spanning and row spanning are encoded in HTML tables, and how precedence can be used to determine the order of the cells in each row.

Have questions or thoughts? Comment on this article or in Slack #puzzles channel. Whichever you prefer.

[Puzzles] Solve, Learn, Repeat: Unmerging HTML Tables

[Puzzles] Solve, Learn, Repeat: Unmerging HTML Tables

HTML Table data

Unmerging columns

Unmerging rows

Collating the row data

Building the table

Tidying up with double transpose

Summary

Unlock Database Monitoring with Splunk Observability Cloud

Print, Leak, Repeat: UEBA Insider Threats You Can't Ignore

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

[Puzzles] Solve, Learn, Repeat: Unmerging HTML Tables

[Puzzles] Solve, Learn, Repeat: Unmerging HTML Tables

HTML Table data

Unmerging columns

Unmerging rows

Collating the row data

Building the table

Tidying up with double transpose

Summary

Unlock Database Monitoring with Splunk Observability Cloud

Print, Leak, Repeat: UEBA Insider Threats You Can't Ignore

Splunk MCP & Agentic AI: Machine Data Without Limits