Splunk Search

Multi Line Field Extraction for XML Data

SplunkDash
Motivator

Hello,

I have XML files with Multi Line field values and have some issues with extracting those values. Sample field extraction code for first 2 values and sample data/events are given below. Any help will be highly appreciated. Thank you!

 

Sample code

<USER>(?P<USER>.+)<\/USER>\\n\\r<USERTYPE>(?P<USERTYPE>.+)<\/USERTYPE>

 

Sample Data

<MTData>

                                <USER>TEST05GLBC</USER>

                                <USERTYPE>Admin</USERTYPE>

                                <SUBJECT />

                                <SESSION>hp0vtlg001</SESSION>

                               <SYSTEM>DS</SYSTEM>

                                <EVENTTYPE>USER_Supervisor</EVENTTYPE>

                                <EVENTID>VIEW</EVENTID>

                                <SIP>10.210.345.254</SIP>

                                <EVENTSTATUS>120</EVENTSTATUS>

                                <EMSG />

                                <STATUS>FALSE</STATUS>

                                <STIME>2022-06-02 19:10:57.967</STIME>

                                <VADDATA>2019:00-00002; 2019:00-0000002; 2019:00-00003</VADDATA>

                                <TIMEPERIOD />

                                <CODE />

                                <RTYPE />

                                <DTFTYPE />

                                <DIP>10.225.35.45</DIP>

                           <DEVICE>Laptop</DEVICE>

                </MTData>

 

                <MTData>

                                <USER>TEST06HLDC</USER>

                                <USERTYPE>Power</USERTYPE>

                                <SUBJECT />

                                <SESSION>hp2ftlg021</SESSION>

                               <SYSTEM>Test</SYSTEM>

                                <EVENTTYPE>USER_MANAGER</EVENTTYPE>

                                <EVENTID>Update</EVENTID>

                                <SIP>10.210.345.254</SIP>

                                <EVENTSTATUS>122</EVENTSTATUS>

                                <EMSG />

                                <STATUS>TRUE</STATUS>

                                <STIME>2022-06-02 19:20:57.967</STIME>

                                <VADDATA>2019:00-00012; 2019:00-0000002; 2019:00-00024</VADDATA>

                                <TIMEPERIOD />

                                <CODE />

                                <RTYPE />

                                <DTFTYPE />

                                <DIP>10.225.35.45</DIP>

                            <DEVICE>Laptop</DEVICE>

                </MTData>

 

Labels (3)
Tags (1)
0 Karma
1 Solution

JacekF
Path Finder

If what you are showing on the screenshot is the entire raw event, you should use KV_MODE setting in props.conf. It will be difficult and inefficient from performance stand point to write a regular expression which will properly handle the xml, especially to deal with elements which can be empty.

But if you really want to go that way, you can extract user and usertype values with the following regex (tested in regex101.com, not in Splunk):

<USER>(?<user>[^<]+)<\/USER>([\r\n]*)<USERTYPE>(?<usertype>[^<]+)

View solution in original post

JacekF
Path Finder

Assuming that xml you've provided is the raw data, you can use the following props.conf configuration:

[sourcetype_name]
BREAK_ONLY_BEFORE=\s*<MTData
CHARSET=UTF-8
KV_MODE=xml
LINE_BREAKER=\s*<MTData
MAX_EVENTS=1000
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=true
TIME_PREFIX=<STIME>

and deploy it to indexer and search head. 

The KV_MODE=xml setting in the above props.conf should properly extract values for you.

SplunkDash
Motivator

Hello,

Thank you so much, please see the following, screenshot,  I was trying to do field extraction here, but didn't have any field get extracted, any recommendation would be highly appreciated, thank you again.

SplunkDash_0-1657691596302.png

 

0 Karma

JacekF
Path Finder

If what you are showing on the screenshot is the entire raw event, you should use KV_MODE setting in props.conf. It will be difficult and inefficient from performance stand point to write a regular expression which will properly handle the xml, especially to deal with elements which can be empty.

But if you really want to go that way, you can extract user and usertype values with the following regex (tested in regex101.com, not in Splunk):

<USER>(?<user>[^<]+)<\/USER>([\r\n]*)<USERTYPE>(?<usertype>[^<]+)

yuanliu
SplunkTrust
SplunkTrust

Alternatively, you can use mode=sed so you don't have to manually program all field names.

| rex mode=sed "s/\s+(<[^>\/]+)>([^<]+)<.+/\n\1=\"\2\"/g"
| kv

Just note that these regex methods don't conform to XML.  There is no guarantee that they will handle future events.

Tags (1)

yuanliu
SplunkTrust
SplunkTrust

@JacekF's solution should work in Splunk Cloud (i.e., no access to props.conf) as well.  You can add BREAK_ONLY_BEFORE and KV_MODE in "Advanced" menu.Go to "Advanced"Go to "Advanced"

The following shows auto extracted fields

Sample events from demonstrated dataSample events from demonstrated data

Tags (1)

yuanliu
SplunkTrust
SplunkTrust

You shouldn't have to - probably shouldn't, anyway, extract XML with regex.  Use builtin command spath.  Using your data, spath gives

MTData.DEVICE
MTData.DIP
MTData.EVENTID
MTData.EVENTSTATUS
MTData.EVENTTYPE
MTData.SESSION
MTData.SIP
MTData.STATUS
MTData.STIME
MTData.SYSTEM
MTData.USER
 
 
Laptop
Laptop
10.225.35.45
10.225.35.45
VIEW
Update
120
122
USER_Supervisor
USER_MANAGER
hp0vtlg001
hp2ftlg021
10.210.345.254
10.210.345.254
FALSE
TRUE
2022-06-02 19:10:57.967
2022-06-02 19:20:57.967
DS
Test
TEST05GLBC
TEST06HLDC
Admin
Power
2019:00-00002; 2019:00-0000002; 2019:00-00003
2019:00-00012; 2019:00-0000002; 2019:00-00024
Is this something you are looking for?
Tags (1)

SplunkDash
Motivator

Hello,

Thank you so much for your quick response. But, how would I incorporate spath into inline field extraction?

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Do you have a field containing the XML document?  Or is the entire _raw in XML?

SplunkDash
Motivator

It's a raw!

0 Karma

yuanliu
SplunkTrust
SplunkTrust
| spath

assumes _raw as input.  If you are just beginning to ingest this document, also consider set up a new sourcetype that uses XML as base type.   In that case, you don't have to do anything in search. (Have you checked existing fields?  Could they have already existed?)

SplunkDash
Motivator

Hello @yuanliu,

Thank you so much, I tried it before and worked from SH, but it didn't work on inline field extraction (setting->fields->field extractions). 

0 Karma

yuanliu
SplunkTrust
SplunkTrust

"setting->fields->field extractions" only works with regex, which is unsuitable for structured data like XML.  Use | spath in search command. (I was mistaken.  There's no XML base type; index-time extraction only supports JSON.  So, inline spath is the best choice.)

SplunkDash
Motivator

@yuanliu ,

Is there any way we can use regex? I know (you also mentioned) regex is not the suitable option. Thank you so much again!

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...