I agree with @richgalloway response here for this particular example. However if you want a more generalize approach to parsing out a table being ingested in Splunk with this format in _raw (capable of accounting for different headers and not having to build out a regex for each time) you may want to try the SPL below. <base_search>
``` Doing this before an mvexpand occurring later in the search pipeline can help attribute the expanded data back to it's orginal event (if needed) ```
| streamstats
count as event_count
``` Extract Info from table title (this is the one line that would need to be adjusted for each unique table depending on title format and data you would want extracted from it ```
| rex field=_raw "Disk\s+Utilization\s+for\s+(?<Server>[^\s]+)\s+in\s+(?<Region>[^\s]+)\s+(?<Environment>[^\s]+)\s+\-(?<Server_IP>[^\s]+)"
``` Parse individual rows from the table as a multivalued field ```
| spath input=_raw path=table.tr output=data
| fields - _raw
``` mvexpand each row to it's own event ```
| mvexpand data
``` replace tags with standardized breakers for parsing ```
| eval
item_data=replace(replace(replace(replace('data', "\<\/t(?:h|d)\>\<t(?:h|d)[^\>]*\>", "<BREAK>"), "^\<t(?:h|d)[^\>]*\>", "<START>"), "\<\/t(?:h|d)\>", "<END>"), "(\<START\>|\<BREAK\>)(\<BREAK\>|\<END\>)", "\1#NULL#\2")
``` set a row count for each table ```
| streamstats
count as row_number
by event_count
``` assign row type depending on the row number (assumes that the table has a header) ```
| eval
row_number='row_number'-1,
row_type=if(
'row_number'==0,
"header",
"data"
),
``` remove start and end tags ```
data_mv=split(replace(replace(item_data, "\<START\>", ""), "\<END\>", ""), "<BREAK>"),
``` trim of any leading or tailing spaces from each entry in the table ```
data_mv=case(
mvcount(data_mv)==1, trim(data_mv, " "),
mvcount(data_mv)>1, mvmap(data_mv, trim(data_mv, " "))
)
| fields - data, item_data
``` pull header fieldnames into each row of the table to stitch it all together ```
| eventstats
list(eval(case('row_type'=="header", 'data_mv'))) as header
by event_count
``` mvzip to attribute each table entry to the header field (we will loop through this field to build the table further down the search pipeline) ```
| eval
data_mv_zip=mvzip(header, data_mv, "<DELIM>")
| fields - header, data_mv
``` we no longer need the row that only holds the header info now that it has been pulled into each data entry row of the table ```
| where NOT 'row_type'=="header"
``` initialize json object that will hold KV pairs of header_fieldname/row_entry ```
| eval
table_json=json_object()
``` loop through the MV field and build the json_object containing all KV assignments ```
| foreach mode=multivalue data_mv_zip
[
| eval
key=mvindex(split('<<ITEM>>', "<DELIM>"), 0),
value=mvindex(split('<<ITEM>>', "<DELIM>"), 1),
table_json=json_set(table_json, 'key', 'value')
]
| fields - key, value, row_type, data_mv_zip
``` spath the json_object just built ```
| spath input=table_json
``` remove the json_object now that we dont need it ```
| fields - table_json
``` list out data in the order that you want it displayed ```
| table _time, event_count, Server, Region, Environment, Server_IP, "Filesystem", "Type", "Blocks", "Used %", "Available %", "Usage %", "Mounted on"
``` remove any placeholder #NULL# values with actual null() values since table has been generated ```
| foreach *
[
| eval
<<FIELD>>=if(
'<<FIELD>>'=="#NULL#",
null(),
'<<FIELD>>'
)
] You can see the output I got locally using this method below. This method should also account for null values in a the html table as well, whereas the regex shared from your original post makes the assumption that table entries always have a leading and trailing space wrapping a non-null value. Which is true for this table you shared, but for generalizing parsing a html table you may run into issues doing it that way. As a proof on concept here is some output of some simulation data I put together with 3 separate events, each with their own html table and sets of data with their own headers and piping them through this method. Below is the SPL used to generate this screenshot. | makeresults
| eval
_raw="<p align='center'><font size='4' color=blue>Example: Table 1</font></p>
<table align=center border=2>
<tr style=background-color:#2711F0 ><th>sample_field_1</th><th>sample_field_2</th><th>sample_field_3</th><th>sample_field_4</th></tr>
<tr><td>table1_row1_column1</td><td></td><td>table1_row1_column3</td><td></td></tr>
<tr><td>table1_row2_column1</td><td>table1_row2_column2</td><td>table1_row2_column3</td><td>table1_row2_column4</td></tr>
<tr><td>table1_row3_column1</td><td></td><td>table1_row3_column3</td><td>table1_row3_column4</td></tr>
<tr><td>table1_row4_column1</td><td>table1_row4_column2</td><td>table1_row4_column3</td><td>table1_row4_column4</td></tr>
<tr><td></td><td></td><td></td><td>table1_row5_column4</td></tr>
</table></body></html>"
| append
[
| makeresults
| eval
_raw="<p align='center'><font size='4' color=blue>Example: Table 2</font></p>
<table align=center border=2>
<tr style=background-color:#2711F0 ><th>sample_field_5</th><th>sample_field_6</th><th>sample_field_7</th></tr>
<tr><td></td><td>table2_row1_column2</td><td>table2_row1_column3</td></tr>
<tr><td>table2_row2_column1</td><td>table2_row2_column2</td><td>table2_row2_column3</td></tr>
<tr><td>table2_row3_column1</td><td></td><td>table2_row3_column3</td></tr>
<tr><td></td><td>table2_row4_column2</td><td></td></tr>
<tr><td>table2_row5_column1</td><td>table2_row5_column2</td><td>table2_row5_column3</td></tr>
</table></body></html>"
]
| append
[
| makeresults
| eval
_raw="<p align='center'><font size='4' color=blue>Example: Table 3</font></p>
<table align=center border=2>
<tr style=background-color:#2711F0 ><th>sample_field_1</th><th>sample_field_2</th><th>sample_field_3</th><th>sample_field_4</th><th>sample_field_5</th><th>sample_field_6</th></tr>
<tr><td>table3_row1_column1</td><td></td><td>table1_row1_column3</td><td></td><td>table1_row1_column5</td><td>table1_row1_column6</td></tr>
<tr><td>table3_row2_column1</td><td>table3_row2_column2</td><td>table1_row2_column3</td><td>table3_row2_column4</td><td>table1_row2_column5</td><td>table1_row2_column6</td></tr>
</table></body></html>"
]
``` Extract Table Titles ```
| spath input=_raw path=p.font output=table_title
| streamstats
count as event_count
| spath input=_raw path=table.tr output=data
| fields - _raw
| mvexpand data
| eval
item_data=replace(replace(replace(replace('data', "\<\/t(?:h|d)\>\<t(?:h|d)[^\>]*\>", "<BREAK>"), "^\<t(?:h|d)[^\>]*\>", "<START>"), "\<\/t(?:h|d)\>", "<END>"), "(\<START\>|\<BREAK\>)(\<BREAK\>|\<END\>)", "\1#NULL#\2")
| streamstats
count as row_number
by event_count
| eval
row_number='row_number'-1,
row_type=if(
'row_number'==0,
"header",
"data"
),
data_mv=split(replace(replace(item_data, "\<START\>", ""), "\<END\>", ""), "<BREAK>"),
data_mv=case(
mvcount(data_mv)==1, trim(data_mv, " "),
mvcount(data_mv)>1, mvmap(data_mv, trim(data_mv, " "))
)
| fields - data, item_data
| eventstats
list(eval(case('row_type'=="header", 'data_mv'))) as header
by event_count
| eval
data_mv_zip=mvzip(header, data_mv, "<DELIM>")
| fields - header, data_mv
| where NOT 'row_type'=="header"
| eval
table_json=json_object()
| foreach mode=multivalue data_mv_zip
[
| eval
key=mvindex(split('<<ITEM>>', "<DELIM>"), 0),
value=mvindex(split('<<ITEM>>', "<DELIM>"), 1),
table_json=json_set(table_json, 'key', 'value')
]
| fields - key, value, row_type, data_mv_zip
| spath input=table_json
| fields - table_json
| fields + _time, table_title, event_count, sample_field_*
| foreach *
[
| eval
<<FIELD>>=if(
'<<FIELD>>'=="#NULL#",
null(),
'<<FIELD>>'
)
]
... View more