Splunk Search

Extracting XML value with <> using regex

michaelrosello
Path Finder

I need to create a regex to match the fieldname for first match and fieldvalue for the second match.

Issue happens when the field value contains "<" and ">" in the value using the regex I created. example below.

<Recommendation><![CDATA[<p><ul><li>Remove all backup files, binary archives, alternate versions of files, and test files from the web document root of production servers.</li><li>Amend your deployment policy to include the removal of these file types by an administrator.</li></ul></p>]]></Recommendation>

I am currently using this regex to get the desired result. providing the regex and sample data I am dealing with. https://regex101.com/r/Pr0Xag/2

this is currently the regex I am using.

<([^>]+)>([^<]*)<\/\1>

transforms.conf

[xml-extr11]
REGEX = <([^>]+)>([^<]*)<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

[setnull]
REGEX = <VulnSummary>
DEST_KEY = queue
FORMAT = nullQueue

props.conf

[nexpose_appspider]
TRANSFORMS-null= setnull
BREAK_ONLY_BEFORE = <Vuln>
NO_BINARY_CHECK = true
TIME_FORMAT = %Y-%m-%d %H:%M:%S
TIME_PREFIX = <ScanDate>
MAX_TIMESTAMP_LOOKAHEAD = 19
TRUNCATE = 0
disabled = false
pulldown_type = true
REPORT-xmlext11 = xml-extr11
KV_MODE = none
MAX_EVENTS = 400
Tags (2)
0 Karma

thomasroulet
Path Finder

Hello,

update your transforms.conf

[xml-extr11]
REGEX = <([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

it will extract the desired fields.
I corrected my previous REGEX

0 Karma

phepales
Loves-to-Learn Everything

Hi Thomas, 

This regex works for my data too, but it does not work when tag is not closing on same line,

for example

<NETWORK_ID>2020</NETWORK_ID>
<NETBIOS>
<![CDATA[WWWW107]]>
</NETBIOS>
<OS>
<![CDATA[Windows 2003 R2]]>
</OS>

Here, it works for NETWORK_ID tag but does not work for NETBIOS and OS tag.

I have tried when I remove the white spaces from tags, it works.

Can you please suggest here to update the regex accordingly.

Thanks in Advance!!

 

 

 

0 Karma

scelikok
SplunkTrust
SplunkTrust

Hi @phepales,

You can use below regex;

[xml-extr11]
REGEX = <([^>]+)>(?:\n)?(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))(?:\n)?<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

 

If this reply helps you an upvote and "Accept as Solution" is appreciated.
0 Karma

phepales
Loves-to-Learn Everything

Hi scelikok ,

Thanks scelikok for your help!!!

actually there are white spaces as you can see in screen shot, regex which you provided is not working in this case. Can you please help me it.

phepales_0-1613644098358.png

<NETWORK_ID>2050</NETWORK_ID>
<DNS>

<![CDATA[wwwwwwwww93]]>
</DNS>
<NETBIOS>
<![CDATA[WWWW93]]>
</NETBIOS>
<OS>
<![CDATA[Windows 2008]]>
</OS>

 

Many Thanks in Advance!!

0 Karma

phepales
Loves-to-Learn Everything

Hi scelikok ,

This regex worked for me, maid some changes

<([^>]+)>(?:\n\s*)?(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))(?:\n\s*)?<\/\1>

 

Thanks!!!

0 Karma

thomasroulet
Path Finder

@michaelrosello
did you solve your problem of extraction.
Did the answers help ? if this is the case, don't forget to accept an answer and vote.

0 Karma

thomasroulet
Path Finder

Hello,

you can use this regex tested with your test string on regex101

<([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|(.*)))<\/\1>

if you perform the extraction on search time you could use this :

`| union
[| makeresults | eval xml="<![CDATA[

  • Remove all backup files, binary archives, alternate versions of files, and test files from the web document root of production servers.
  • Amend your deployment policy to include the removal of these file types by an administrator.

]]>"],
[| makeresults | eval xml="D9327888CC8545948C8D62D4FF515BDE"],
[| makeresults | eval xml="<![CDATA[

A backup file was discovered. Binary archives or application files with an alternate file extension may expose source code and application logic to an attacker. If a script's file extension does not match an application extension (such as .asp, .jsp, or .php), then the server usually considers the file equivalent to plain text. When this happens, the server presents the user with the raw source code of the file instead of executing the script and providing interpreted output.
Depending on the content of the script file, the exposure of data varies between simple function calls to database connection credentials to administration passwords.


File archives such as .tgz, .tar.gz, or .zip files should never be stored within the web application's document root. If these files contain an archive of the application's source code, then it will be trivial for an attacker to download and examine the code.]]>
"]

| rex field=xml "<(?<key>[^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?<value>|(.*))\]\]>)|(?<value>|(.*)))<\/\1>"`

the important thing is the last line.

0 Karma

michaelrosello
Path Finder

I've tried this and It is working in regex101 but not in Splunk, I suspect because of too many steps?

0 Karma

thomasroulet
Path Finder

in transforms.conf

[testxml]
SOURCE_KEY = _raw
REGEX = <([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|(.*)))<\/\1>
FORMAT = $1::$2

at search time assuming the data is in _raw:
| extract testxml

at index time :
in props.conf

[yoursourcetype]
TRANSFORMS-testxml = testxml
0 Karma

thomasroulet
Path Finder

you wan't to extract at index time or at search time ?
Could you post a complete event for example ?

0 Karma

michaelrosello
Path Finder

Here is the complete event, I also update the question with my props and transforms.
https://regex101.com/r/Pr0Xag/6

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...