Solved: Help with RegEx - Select only XML nodes that conta...

adcon82 · ‎03-16-2020

Hello Everyone,

I'm trying to put together a regex statement that will allow me to select only the XML nodes that contain values. In the actual data, there are tons of XML nodes, some may have data, some may not. Instead of defining all of them individually, I'd like to make it more dynamic. My thought was to use a regex in order to select only the nodes that have values, and then use a table * type command to send what's pulled back to a table. If there is a better way to do this using spath or xpath, I'm all ears!

So far, I can achieve what I want to if the XML is on individual lines, using the expression below. The problem is, the XML is streamed, and this expression will not work for streamed XML. I spent a few hours trying to get a working regex to no avail. Any help is greatly appreciated!

Regex that works for XML on individual lines. This will omit empty tags and select all other values.
(.+)>(?![<\/]).+

<root>
<Node1>Value1</Node1>
<Node2>Value2</Node2>
<Node3></Node3>
<Node4></Node4>
<Node5>Value3</Node5>
<Node6>Value4</Node6>
</root>

Actual Data is contained all on one line - Unable to get a regex that can do what is being done above.

<root><Node1>Value1</Node1><Node2>Value2</Node2><Node3></Node3><Node4></Node4><Node5>Value3</Node5><Node6>Value4</Node6></root>

manjunathmeti · ‎03-16-2020

hi @adcon82,

Use rex command with sed mode to remove nodes without any values. Try this:

| makeresults 
| eval _raw="<root><Node1>Value1</Node1><Node2>Value2</Node2><Node3></Node3><Node4></Node4><Node5>Value3</Node5><Node6>Value4</Node6></root>" | append [| makeresults 
| eval _raw="<root><Node1>Value1</Node1><Node2>Value2</Node2><Node3></Node3><Node4>Value4</Node4><Node5></Node5><Node6></Node6></root>" ] 
| rex mode=sed "s/<Node\d+><\/Node\d+>//g"

View solution in original post

adcon82 · ‎03-16-2020

I found an article that helped me narrow down my table results. Below is the code for reference if anybody comes across this article. I've only ran a few base tests, but so far it seems to narrow my table results to what was returned from the search.

|streamstats count as temp_id
|stats values() by temp_id
|rename values() as *
|table *

adcon82 · ‎03-16-2020

Well, that seemed to perform the substitute as expected. I only get back about a dozen fields that only contain data in my search. The log file itself for this entry contains about 50. However, when I run a "table *" command after the search, it outputs all 50 columns to my table, instead of just the fields showing up on the search. Not sure why it would show the full column list, when my search is omitting these columns.

manjunathmeti · ‎03-16-2020

hi @adcon82,

Use rex command with sed mode to remove nodes without any values. Try this:

| makeresults 
| eval _raw="<root><Node1>Value1</Node1><Node2>Value2</Node2><Node3></Node3><Node4></Node4><Node5>Value3</Node5><Node6>Value4</Node6></root>" | append [| makeresults 
| eval _raw="<root><Node1>Value1</Node1><Node2>Value2</Node2><Node3></Node3><Node4>Value4</Node4><Node5></Node5><Node6></Node6></root>" ] 
| rex mode=sed "s/<Node\d+><\/Node\d+>//g"

adcon82 · ‎03-16-2020

Thank you for the response! But what if we don't know the name of the node. Live data in this case contains hundreds of nodes, whose names will vary.

adcon82 · ‎03-16-2020

I think I figured it out! If you see any issues with this, please let me know.

rex mode=sed "s/<[a-zA-Z]\w+><\/[a-zA-Z]\w+> //g"

manjunathmeti · ‎03-16-2020

You can just do this. \w+ matches one or more alpha-numeric characters.

| rex mode=sed "s/<\w+><\/\w+>//g"

Please accept answer if it is working for you.

adcon82 · ‎03-16-2020

Works great, thanks for the tip. I'll remove the extraneous code.

Now it seems to be a problem with the table command ran after that. I assumed if the search I run after the regex returned only the columns that contained data, I could just run "table *". However, when running it, it's pulling all columns, even the empty ones, total of about 50. This seems strange considering my search is only pulling back the dozen tables that contain data.

manjunathmeti · ‎03-16-2020

Yes, since fields are already extracted in index, they'll show up in table even if they are empty in all rows. If you don't want see fields with no values in all rows then use stats:

| stats values(*) as * by _time

adcon82 · ‎03-16-2020

Thanks so much for all of the help manjunathmeti!!

Help with RegEx - Select only XML nodes that contain values

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas

What Is the Name of the USB Key Inserted by Bob Smith? (BOTS Hint, Not the Answer)

Automating Threat Operations and Threat Hunting with Recorded Future

Join the Conversation