Solved: Multi Line Field Extraction for XML Data

SplunkDash · ‎07-12-2022

Hello,

I have XML files with Multi Line field values and have some issues with extracting those values. Sample field extraction code for first 2 values and sample data/events are given below. Any help will be highly appreciated. Thank you!

Sample code

Sample Data

<USERTYPE>Admin</USERTYPE>

<EVENTTYPE>USER_Supervisor</EVENTTYPE>

<STATUS>FALSE</STATUS>

<DEVICE>Laptop</DEVICE>

</MTData>

<USERTYPE>Power</USERTYPE>

<EVENTTYPE>USER_MANAGER</EVENTTYPE>

<EVENTID>Update</EVENTID>

<DEVICE>Laptop</DEVICE>

</MTData>

JacekF · ‎07-12-2022

If what you are showing on the screenshot is the entire raw event, you should use KV_MODE setting in props.conf. It will be difficult and inefficient from performance stand point to write a regular expression which will properly handle the xml, especially to deal with elements which can be empty.

But if you really want to go that way, you can extract user and usertype values with the following regex (tested in regex101.com, not in Splunk):

<USER>(?<user>[^<]+)<\/USER>([\r\n]*)<USERTYPE>(?<usertype>[^<]+)

View solution in original post

JacekF · ‎07-12-2022

Assuming that xml you've provided is the raw data, you can use the following props.conf configuration:

[sourcetype_name]
BREAK_ONLY_BEFORE=\s*<MTData
CHARSET=UTF-8
KV_MODE=xml
LINE_BREAKER=\s*<MTData
MAX_EVENTS=1000
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=true
TIME_PREFIX=<STIME>

and deploy it to indexer and search head.

The KV_MODE=xml setting in the above props.conf should properly extract values for you.

SplunkDash · ‎07-12-2022

Hello,

Thank you so much, please see the following, screenshot, I was trying to do field extraction here, but didn't have any field get extracted, any recommendation would be highly appreciated, thank you again.

JacekF · ‎07-12-2022

If what you are showing on the screenshot is the entire raw event, you should use KV_MODE setting in props.conf. It will be difficult and inefficient from performance stand point to write a regular expression which will properly handle the xml, especially to deal with elements which can be empty.

But if you really want to go that way, you can extract user and usertype values with the following regex (tested in regex101.com, not in Splunk):

<USER>(?<user>[^<]+)<\/USER>([\r\n]*)<USERTYPE>(?<usertype>[^<]+)

yuanliu · ‎07-13-2022

Alternatively, you can use mode=sed so you don't have to manually program all field names.

| rex mode=sed "s/\s+(<[^>\/]+)>([^<]+)<.+/\n\1=\"\2\"/g"
| kv

Just note that these regex methods don't conform to XML. There is no guarantee that they will handle future events.

yuanliu · ‎07-12-2022

@JacekF's solution should work in Splunk Cloud (i.e., no access to props.conf) as well. You can add BREAK_ONLY_BEFORE and KV_MODE in "Advanced" menu.Go to "Advanced"

The following shows auto extracted fields

Sample events from demonstrated data

yuanliu · ‎07-12-2022

You shouldn't have to - probably shouldn't, anyway, extract XML with regex. Use builtin command spath. Using your data, spath gives

MTData.DEVICE

MTData.DIP

MTData.EVENTID

MTData.EVENTSTATUS

MTData.EVENTTYPE

MTData.SESSION

MTData.SIP

MTData.STATUS

MTData.STIME

MTData.SYSTEM

MTData.USER

Laptop

10.225.35.45

VIEW

Update

120

122

USER_Supervisor

USER_MANAGER

hp0vtlg001

hp2ftlg021

10.210.345.254

FALSE

TRUE

2022-06-02 19:10:57.967

2022-06-02 19:20:57.967

DS

Test

TEST05GLBC

TEST06HLDC

Admin

Power

2019:00-00002; 2019:00-0000002; 2019:00-00003

2019:00-00012; 2019:00-0000002; 2019:00-00024

Is this something you are looking for?

SplunkDash · ‎07-12-2022

Hello,

Thank you so much for your quick response. But, how would I incorporate spath into inline field extraction?

yuanliu · ‎07-12-2022

Do you have a field containing the XML document? Or is the entire _raw in XML?

SplunkDash · ‎07-12-2022

It's a raw!

yuanliu · ‎07-12-2022

| spath

assumes _raw as input. If you are just beginning to ingest this document, also consider set up a new sourcetype that uses XML as base type. In that case, you don't have to do anything in search. (Have you checked existing fields? Could they have already existed?)

SplunkDash · ‎07-12-2022

Hello @yuanliu,

Thank you so much, I tried it before and worked from SH, but it didn't work on inline field extraction (setting->fields->field extractions).

yuanliu · ‎07-12-2022

"setting->fields->field extractions" only works with regex, which is unsuitable for structured data like XML. Use | spath in search command. (I was mistaken. There's no XML base type; index-time extraction only supports JSON. So, inline spath is the best choice.)

SplunkDash · ‎07-12-2022

@yuanliu ,

Is there any way we can use regex? I know (you also mentioned) regex is not the suitable option. Thank you so much again!

Multi Line Field Extraction for XML Data

field extraction

regex

rex

Splunk Answers Content Calendar, July Edition I

Secure Your Future: Mastering Upgrade Readiness for Splunk 10

Observability Unlocked: Kubernetes & Cloud Monitoring with Splunk IM

Are you a member of the Splunk Community?

Multi Line Field Extraction for XML Data

field extraction

regex

rex

Splunk Answers Content Calendar, July Edition I

Secure Your Future: Mastering Upgrade Readiness for Splunk 10

Observability Unlocked: Kubernetes & Cloud Monitoring with Splunk IM