For a previous puzzle, I needed some sample data, and while researching for this, I came across the data I was interested in, but it was in an HTML Table. This inspired me to create this puzzle. The challenge is to take an HTML table and convert it to a Splunk table. The HTML (your input) looks like this:
<table><tbody>
<tr><td colspan="5">1980s</td></tr>
<tr><th>Imaged</th><th>Published</th><th>Name</th><th>Designation</th><th>Discoverers</th></tr>
<tr><td>30 December 1985</td><td>9 January 1986</td><td>Puck</td><td>Uranus XV</td><td rowspan="7">Synnott, Voyager 2</td></tr>
<tr><td rowspan="2">3 January 1986</td><td rowspan="6">16 January 1986</td><td>Juliet</td><td>Uranus XI</td></tr>
<tr><td>Portia</td><td>Uranus XII</td></tr>
<tr><td>9 January 1986</td><td>Cressida</td><td>Uranus IX</td></tr>
<tr><td rowspan="3">13 January 1986</td><td>Desdemona</td><td>Uranus X</td></tr>
<tr><td>Rosalind</td><td>Uranus XIII</td></tr>
<tr><td>Belinda</td><td>Uranus XIV</td></tr>
<tr><td rowspan="2">20 January 1986</td><td rowspan="3">27January 1986</td><td>Cordelia</td><td>Uranus VI</td><td rowspan="2">Terrile, Voyager 2</td></tr>
<tr><td>Ophelia</td><td>Uranus VII</td></tr>
<tr><td>23 January 1986</td><td>Bianca</td><td>Uranus VIII</td><td>Smith, Voyager 2</td></tr>
<tr><th>Imaged</th><th>Published</th><th>Name</th><th>Designation</th><th>Discoverers</th></tr>
</tbody></table>This can be crudely rendered like this:
+------------------+-----------------+-----------+-------------+--------------------+
| 1980s |
+==================+=================+===========+=============+====================+
| Imaged | Published | Name | Designation | Discoverers |
+==================+=================+===========+=============+====================+
| 30 December 1985 | 9 January 1986 | Puck | Uranus XV | Synnott, Voyager 2 |
+------------------+-----------------+-----------+-------------+ +
| 3 January 1986 | 16 January 1986 | Juliet | Uranus XI | |
+ + +-----------+-------------+ +
| | | Portia | Uranus XII | |
+------------------+ +-----------+-------------+ +
| 9 January 1986 | | Cressida | Uranus IX | |
+------------------+ +-----------+-------------+ +
| 13 January 1986 | | Desdemona | Uranus X | |
+ + +-----------+-------------+ +
| | | Rosalind | Uranus XIII | |
+ + +-----------+-------------+ +
| | | Belinda | Uranus XIV | |
+------------------+-----------------+-----------+-------------+--------------------+
| 20 January 1986 | 27 January 1986 | Cordelia | Uranus VI | Terrile, Voyager 2 |
+ + +-----------+-------------+ +
| | | Ophelia | Uranus VII | |
+------------------+ +-----------+-------------+--------------------+
| 23 January 1986 | | Bianca | Uranus VIII | Smith, Voyager 2 |
+==================+=================+===========+=============+====================+
| Imaged | Published | Name | Designation | Discoverers |
+==================+=================+===========+=============+====================+The challenge is to parse the HTML data, unmerge the cells and produce a table like this (using the header data for the field names):
| Imaged | Published | Name | Designation | Discoverers |
| 1980s | 1980s | 1980s | 1980s | 1980s |
| 30 December 1985 | 9 January 1986 | Puck | Uranus XV | Synnott, Voyager 2 |
| 3 January 1986 | 16 January 1986 | Juliet | Uranus XI | Synnott, Voyager 2 |
| 3 January 1986 | 16 January 1986 | Portia | Uranus XII | Synnott, Voyager 2 |
| 9 January 1986 | 16 January 1986 | Cressida | Uranus IX | Synnott, Voyager 2 |
| 13 January 1986 | 16 January 1986 | Desdemona | Uranus X | Synnott, Voyager 2 |
| 13 January 1986 | 16 January 1986 | Rosalind | Uranus XIII | Synnott, Voyager 2 |
| 13 January 1986 | 16 January 1986 | Belinda | Uranus XIV | Synnott, Voyager 2 |
| 20 January 1986 | 27 January 1986 | Cordelia | Uranus VI | Terrile, Voyager 2 |
| 20 January 1986 | 27 January 1986 | Ophelia | Uranus VII | Terrile, Voyager 2 |
| 23 January 1986 | 27 January 1986 | Bianca | Uranus VIII | Smith, Voyager 2 |
This article contains spoilers!
In fact, the whole article is a spoiler as it contains solutions to the puzzle. If you are trying to solve the puzzle yourself and just want some pointers to get you started, stop reading when you have enough, and return if you get stuck again, or just want to compare your solution to mine!
As you probably already know, HTML table data is structured hierarchically:
<table>
<tbody>
<tr>
<td>
Cell data
</td>
</tr>
</tbody>
</table>The tr (table row) elements can be repeated multiple times, and the td (table data) elements can be repeated multiple times and/or replaced by th (table header) elements.
The td (and th) elements can have optional spanning attributes to declare how many columns or rows the data is spread across (similar to merging cells in a spreadsheet).
Given that the target format is a table with rows and columns, we need to extract the rows from the HTML data, then extract the cell data for each column from the row data.
| makeresults
| fields - _time
| eval _raw="<table><tbody>
<tr><td colspan=\"5\">1980s</td></tr>
<tr><th>Imaged</th><th>Published</th><th>Name</th><th>Designation</th><th>Discoverers</th></tr>
<tr><td>30 December 1985</td><td>9 January 1986</td><td>Puck</td><td>Uranus XV</td><td rowspan=\"7\">Synnott, Voyager 2</td></tr>
<tr><td rowspan=\"2\">3 January 1986</td><td rowspan=\"6\">16 January 1986</td><td>Juliet</td><td>Uranus XI</td></tr>
<tr><td>Portia</td><td>Uranus XII</td></tr>
<tr><td>9 January 1986</td><td>Cressida</td><td>Uranus IX</td></tr>
<tr><td rowspan=\"3\">13 January 1986</td><td>Desdemona</td><td>Uranus X</td></tr>
<tr><td>Rosalind</td><td>Uranus XIII</td></tr>
<tr><td>Belinda</td><td>Uranus XIV</td></tr>
<tr><td rowspan=\"2\">20 January 1986</td><td rowspan=\"3\">27 January 1986</td><td>Cordelia</td><td>Uranus VI</td><td rowspan=\"2\">Terrile, Voyager 2</td></tr>
<tr><td>Ophelia</td><td>Uranus VII</td></tr>
<tr><td>23 January 1986</td><td>Bianca</td><td>Uranus VIII</td><td>Smith, Voyager 2</td></tr>
<tr><th>Imaged</th><th>Published</th><th>Name</th><th>Designation</th><th>Discoverers</th></tr>
</tbody></table>"
``` Extract each row and expand into separate events ```
| rex max_match=0 "(?ms)(?<row>\<tr.*?\<\/tr\>)"
| fields - _raw
| mvexpand rowNext, give each row a unique identifier
``` Give each row a unique identifier ```
| streamstats count as rownumNow, extract the cell data from each row event
``` Extract the cell data from the row and expand into separate events ```
| rex field=row max_match=0 "(?ms)(?<data>\<(th|td).*?\<\/\2\>)"
| fields - row
| mvexpand dataThere could be spanning attributes, so extract those
``` Extract the spanning information ```
| rex field=data "colspan=\"(?<colspan>\d+)\""
| rex field=data "rowspan=\"(?<rowspan>\d+)\""In order to unmerge a column, we simply have to replicate the column data the requisite number of times.
``` Replicate cells which span multiple columns ```
| eval expand=mvrange(0, colspan)
| mvexpand expandNow, give each column within the row a unique identifier
``` Give each column in the row a unique identifier ```
| streamstats count as colnum by rownumNote that this identifies the column within the HTML table row, and it may not be the final column number in the target Splunk table because cells from above may have been spanned and actually appear before this cell.
This is where things start to get tricky. When cells span multiple rows in HTML tables, there is no corresponding td element in the HTML tr structure in the spanned rows. This means that the unique identifier which is based on the count of td elements in the original row may not accurately reflect the desired column position. When rows are unmerged, cells are added to subsequent rows, but the columns in the rows have already been unmerged and tagged with unique identifiers! How do we know which order the td elements should appear in the row when the rows are unmerged?
The first thing to do is to replicate the cell data the requisite number of times
``` Replicate cells which span multiple rows ```
| eval expand=mvrange(0, rowspan)
| mvexpand expandCalculate a target row number based on original row number and whether it is part of a row span expansion
``` Calculate target row number ```
| eval new_rownum=rownum+expandTo collate the row data, we need to know how many column fields are needed
``` Find the maximum number of columns ```
| eventstats max(colnum) as max_columnsCreate a multi-value field with indexes for each of the column fields
``` Create a multi--value field with indexes for each column ```
| eval max_columns=mvrange(0,max_columns)Duplicate the multi-value field and expand it to produce an event for each column in each row
``` Duplicate and expand column indexes ```
| eval columns=max_columns
| mvexpand columnsSince we need the fields to be named from the th elements, extract these names and create a list. If there are multiple rows with th elements, only names from the first row will be used.
``` Extract header names from the data and list them for each cell ```
| rex field=data "\<th>(?<th>[^<]+)"
| eventstats list(th) as column by columnsNow that we have extracted the field names from the th elements, these events are no longer required
``` Remove header rows as they are now redundant ```
| where isnull(th)Find the corresponding column name using the unique identifier for the column as an index into the list of possible column field names
``` Find the corresponding column name based on the unique identifier for the column ```
| eval colname=mvindex(column,columns)Create a new field using the unique identifier for the column (to maintain the order of the fields) and the discovered column field name
``` Create a new field based on the unique identifier and the column name holding the column index ```
| eval f_{columns}_{colname}=columnsNote that this works because there are no more than 10 fields (indexed 0 - 9). Any more than this, and the unique identifier would need to be zero-padded to maintain the order lexicographically.
Compress the cell information, gathering the cell data, the column field identifiers in the column field name fields and the multi-value field with all the column identifiers, by row and column
``` Compress the cell information ```
| stats values(data) as data values(f_*) as f_* values(max_columns) as max_columns by rownum colnum new_rownumSort the events by original row number and column number to maintain correct precedence, i.e. the order in which the original rows (tr) were defined in the input
``` Sort by original row identifier and column identifier to ensure correct precedence when unmerging cells ```
| sort 0 rownum colnumCompress the row information using lists for column numbers, and data, and values for all the column identifiers and column field name fields by target row number
``` Compress the row information ```
| stats list(colnum) as colnum list(data) as data values(max_columns) as max_columns values(f_*) as f_* by new_rownumNotice that the rows are sorted lexicographically, so re-sort to fix that
``` Notice that the rows are sorted lexicographically so let us fix that ```
| sort 0 new_rownumWe now have sufficient information about each row to be able to build the table. For each column field, we need to find the data associated with the left-most column, including those which have been spanned down from prior rows, where the column position is not too far to the right.
Using nested loops (as previously used in another puzzle solution), we can search through the list of column numbers until we find one which does not exceed the column number we are looking for.
When a suitable column is found, update the field (removing HTML tags in the process); remove the found data from the list of remaining (unclaimed) data; and, remove the corresponding original column number from the list of remaining (unclaimed) column numbers.
``` For each of the final fields ```
| foreach f_*
[
``` Create an ordered list of data where data for a column from a prior row takes precedence ```
``` Set a flag to start the process of finding the correct data for the field ```
| eval found=0
``` For each column index ```
| foreach mode=multivalue max_columns
[
``` Since multivalue mode only allows one command, there is a lot going on here:
If still looking for field data:
If current field index + 1 >= original column number:
Update flag to say we have found the data index we are looking for
If found the data index:
Update field with the data we found, removing HTML tags at the same time
Remove the found data from the remaining (unclaimed) data list
Remove the found original column number from the remaining (unclaimed) column number list
Update flag to say we have finished processing for this field
```
| eval found=if(found=0 AND <<FIELD>>+1>=tonumber(mvindex(colnum,<<ITEM>>)),1,found), <<FIELD>>=if(found=1,replace(mvindex(data,<<ITEM>>),"(\<td[^>]*>)(.*)(\<\/td>)","\2"),<<FIELD>>), data=if(found=1,if(<<ITEM>>=0,mvindex(data,1,-1),mvappend(mvindex(data,0,<<ITEM>>-1),mvindex(data,<<ITEM>>+1,-1))),data), colnum=if(found=1,if(<<ITEM>>=0,mvindex(colnum,1,-1),mvappend(mvindex(colnum,0,<<ITEM>>-1),mvindex(colnum,<<ITEM>>+1,-1))),colnum), found=if(found=1,2,found)
]
]Now, all the data is in the correct column field so we can discard all the other fields
``` Since all the data has been moved, reduce the events to just the fields we are interested in ```
| table f_*The fields we are left with contain the column names from the td elements but they also contain the column numbers so we want to rename them. Since we know how many fields we have, we could hard-code each rename
| rename f_0_* as *, f_1_* as *, …However, this should be done without hard-coding. We could try copying the fields to fields with just the header name part (which is in the second match segment)
| foreach f_*_*
[| eval _name="<<MATCHSEG2>>"
| eval {_name}=<<FIELD>>]
| fields - f_*Unfortunately, this does not quite work because the fields in the table are re-ordered lexicographically.
Instead, we can transpose the events twice, modifying the column name in between
``` Instead, we can use transpose twice, modifying the column field name in between ```
| transpose 0 column_name=header
| eval header=mvindex(split(header,"_"),2)
| transpose 0 header_field=header
| fields - columnThis gives the renamed fields whilst maintaining the order they were listed in the th elements.
Note that this double transpose method works because of the limited number of rows and columns.
In summary, the key to solving this puzzle is to recognise how column spanning and row spanning are encoded in HTML tables, and how precedence can be used to determine the order of the cells in each row.
Have questions or thoughts? Comment on this article or in Slack #puzzles channel. Whichever you prefer.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.