Re: Extracting XML value with <> using regex

michaelrosello · ‎07-21-2019

I need to create a regex to match the fieldname for first match and fieldvalue for the second match.

Issue happens when the field value contains "<" and ">" in the value using the regex I created. example below.

<Recommendation><![CDATA[<p><ul><li>Remove all backup files, binary archives, alternate versions of files, and test files from the web document root of production servers.</li><li>Amend your deployment policy to include the removal of these file types by an administrator.</li></ul></p>]]></Recommendation>

I am currently using this regex to get the desired result. providing the regex and sample data I am dealing with. https://regex101.com/r/Pr0Xag/2

this is currently the regex I am using.

<([^>]+)>([^<]*)<\/\1>

transforms.conf

[xml-extr11]
REGEX = <([^>]+)>([^<]*)<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

[setnull]
REGEX = <VulnSummary>
DEST_KEY = queue
FORMAT = nullQueue

props.conf

[nexpose_appspider]
TRANSFORMS-null= setnull
BREAK_ONLY_BEFORE = <Vuln>
NO_BINARY_CHECK = true
TIME_FORMAT = %Y-%m-%d %H:%M:%S
TIME_PREFIX = <ScanDate>
MAX_TIMESTAMP_LOOKAHEAD = 19
TRUNCATE = 0
disabled = false
pulldown_type = true
REPORT-xmlext11 = xml-extr11
KV_MODE = none
MAX_EVENTS = 400

thomasroulet · ‎07-23-2019

Hello,

update your transforms.conf

[xml-extr11]
REGEX = <([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

it will extract the desired fields.
I corrected my previous REGEX

phepales · ‎02-16-2021

Hi Thomas,

This regex works for my data too, but it does not work when tag is not closing on same line,

for example

<NETWORK_ID>2020</NETWORK_ID>
<NETBIOS>
<![CDATA[WWWW107]]>
</NETBIOS>
<OS>
<![CDATA[Windows 2003 R2]]>
</OS>

Here, it works for NETWORK_ID tag but does not work for NETBIOS and OS tag.

I have tried when I remove the white spaces from tags, it works.

Can you please suggest here to update the regex accordingly.

Thanks in Advance!!

scelikok · ‎02-17-2021

Hi @phepales,

You can use below regex;

[xml-extr11]
REGEX = <([^>]+)>(?:\n)?(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))(?:\n)?<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

If this reply helps you an upvote and "Accept as Solution" is appreciated.

phepales · ‎02-18-2021

Hi scelikok ,

Thanks scelikok for your help!!!

actually there are white spaces as you can see in screen shot, regex which you provided is not working in this case. Can you please help me it.

<NETWORK_ID>2050</NETWORK_ID>
<DNS>

<![CDATA[wwwwwwwww93]]>
</DNS>
<NETBIOS>
<![CDATA[WWWW93]]>
</NETBIOS>
<OS>
<![CDATA[Windows 2008]]>
</OS>

Many Thanks in Advance!!

phepales · ‎02-18-2021

Hi scelikok ,

This regex worked for me, maid some changes

<([^>]+)>(?:\n\s*)?(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))(?:\n\s*)?<\/\1>

Thanks!!!

thomasroulet · ‎09-05-2019

@michaelrosello
did you solve your problem of extraction.
Did the answers help ? if this is the case, don't forget to accept an answer and vote.

thomasroulet · ‎07-22-2019

Hello,

you can use this regex tested with your test string on regex101

<([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|(.*)))<\/\1>

if you perform the extraction on search time you could use this :

`| union
[| makeresults | eval xml="<![CDATA[
Remove all backup files, binary archives, alternate versions of files, and test files from the web document root of production servers.
Amend your deployment policy to include the removal of these file types by an administrator.
]]>"],
[| makeresults | eval xml="D9327888CC8545948C8D62D4FF515BDE"],
[| makeresults | eval xml="<![CDATA[
A backup file was discovered. Binary archives or application files with an alternate file extension may expose source code and application logic to an attacker. If a script's file extension does not match an application extension (such as .asp, .jsp, or .php), then the server usually considers the file equivalent to plain text. When this happens, the server presents the user with the raw source code of the file instead of executing the script and providing interpreted output.
Depending on the content of the script file, the exposure of data varies between simple function calls to database connection credentials to administration passwords.

File archives such as .tgz, .tar.gz, or .zip files should never be stored within the web application's document root. If these files contain an archive of the application's source code, then it will be trivial for an attacker to download and examine the code.]]>"]

| rex field=xml "<(?<key>[^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?<value>|(.*))\]\]>)|(?<value>|(.*)))<\/\1>"`

the important thing is the last line.

michaelrosello · ‎07-23-2019

I've tried this and It is working in regex101 but not in Splunk, I suspect because of too many steps?

thomasroulet · ‎07-23-2019

in transforms.conf

[testxml]
SOURCE_KEY = _raw
REGEX = <([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|(.*)))<\/\1>
FORMAT = $1::$2

at search time assuming the data is in _raw:
| extract testxml

at index time :
in props.conf

[yoursourcetype]
TRANSFORMS-testxml = testxml

thomasroulet · ‎07-23-2019

you wan't to extract at index time or at search time ?
Could you post a complete event for example ?

michaelrosello · ‎07-23-2019

Here is the complete event, I also update the question with my props and transforms.
https://regex101.com/r/Pr0Xag/6

Extracting XML value with <> using regex

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I

Are you a member of the Splunk Community?

Extracting XML value with <> using regex

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I