I need to create a regex to match the fieldname for first match and fieldvalue for the second match.
Issue happens when the field value contains "<" and ">" in the value using the regex I created. example below.
<Recommendation><![CDATA[<p><ul><li>Remove all backup files, binary archives, alternate versions of files, and test files from the web document root of production servers.</li><li>Amend your deployment policy to include the removal of these file types by an administrator.</li></ul></p>]]></Recommendation>
I am currently using this regex to get the desired result. providing the regex and sample data I am dealing with. https://regex101.com/r/Pr0Xag/2
this is currently the regex I am using.
<([^>]+)>([^<]*)<\/\1>
transforms.conf
[xml-extr11]
REGEX = <([^>]+)>([^<]*)<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true
[setnull]
REGEX = <VulnSummary>
DEST_KEY = queue
FORMAT = nullQueue
props.conf
[nexpose_appspider]
TRANSFORMS-null= setnull
BREAK_ONLY_BEFORE = <Vuln>
NO_BINARY_CHECK = true
TIME_FORMAT = %Y-%m-%d %H:%M:%S
TIME_PREFIX = <ScanDate>
MAX_TIMESTAMP_LOOKAHEAD = 19
TRUNCATE = 0
disabled = false
pulldown_type = true
REPORT-xmlext11 = xml-extr11
KV_MODE = none
MAX_EVENTS = 400
Hello,
update your transforms.conf
[xml-extr11]
REGEX = <([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true
it will extract the desired fields.
I corrected my previous REGEX
Hi Thomas,
This regex works for my data too, but it does not work when tag is not closing on same line,
for example
<NETWORK_ID>2020</NETWORK_ID>
<NETBIOS>
<![CDATA[WWWW107]]>
</NETBIOS>
<OS>
<![CDATA[Windows 2003 R2]]>
</OS>
Here, it works for NETWORK_ID tag but does not work for NETBIOS and OS tag.
I have tried when I remove the white spaces from tags, it works.
Can you please suggest here to update the regex accordingly.
Thanks in Advance!!
Hi @phepales,
You can use below regex;
[xml-extr11]
REGEX = <([^>]+)>(?:\n)?(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))(?:\n)?<\/\1>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true
Hi scelikok ,
Thanks scelikok for your help!!!
actually there are white spaces as you can see in screen shot, regex which you provided is not working in this case. Can you please help me it.
<NETWORK_ID>2050</NETWORK_ID>
<DNS>
<![CDATA[wwwwwwwww93]]>
</DNS>
<NETBIOS>
<![CDATA[WWWW93]]>
</NETBIOS>
<OS>
<![CDATA[Windows 2008]]>
</OS>
Many Thanks in Advance!!
Hi scelikok ,
This regex worked for me, maid some changes
<([^>]+)>(?:\n\s*)?(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|([^<]*)))(?:\n\s*)?<\/\1>
Thanks!!!
@michaelrosello
did you solve your problem of extraction.
Did the answers help ? if this is the case, don't forget to accept an answer and vote.
Hello,
you can use this regex tested with your test string on regex101
<([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|(.*)))<\/\1>
if you perform the extraction on search time you could use this :
`| union
[| makeresults | eval xml="<![CDATA[ ]]>"],
- Remove all backup files, binary archives, alternate versions of files, and test files from the web document root of production servers.
- Amend your deployment policy to include the removal of these file types by an administrator.
[| makeresults | eval xml="D9327888CC8545948C8D62D4FF515BDE "],
[| makeresults | eval xml="<![CDATA[ "]A backup file was discovered. Binary archives or application files with an alternate file extension may expose source code and application logic to an attacker. If a script's file extension does not match an application extension (such as .asp, .jsp, or .php), then the server usually considers the file equivalent to plain text. When this happens, the server presents the user with the raw source code of the file instead of executing the script and providing interpreted output.
Depending on the content of the script file, the exposure of data varies between simple function calls to database connection credentials to administration passwords.
File archives such as .tgz, .tar.gz, or .zip files should never be stored within the web application's document root. If these files contain an archive of the application's source code, then it will be trivial for an attacker to download and examine the code.]]>
| rex field=xml "<(?<key>[^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?<value>|(.*))\]\]>)|(?<value>|(.*)))<\/\1>"`
the important thing is the last line.
I've tried this and It is working in regex101 but not in Splunk, I suspect because of too many steps?
in transforms.conf
[testxml]
SOURCE_KEY = _raw
REGEX = <([^>]+)>(?|(?=<!\[CDATA\[.*\]\]>)(?|<!\[CDATA\[(?|(.*))\]\]>)|(?|(.*)))<\/\1>
FORMAT = $1::$2
at search time assuming the data is in _raw:
| extract testxml
at index time :
in props.conf
[yoursourcetype]
TRANSFORMS-testxml = testxml
you wan't to extract at index time or at search time ?
Could you post a complete event for example ?
Here is the complete event, I also update the question with my props and transforms.
https://regex101.com/r/Pr0Xag/6