Hello,
I have XML files with Multi Line field values and have some issues with extracting those values. Sample field extraction code for first 2 values and sample data/events are given below. Any help will be highly appreciated. Thank you!
Sample code
<USER>(?P<USER>.+)<\/USER>\\n\\r<USERTYPE>(?P<USERTYPE>.+)<\/USERTYPE>
Sample Data
<MTData>
<USER>TEST05GLBC</USER>
<USERTYPE>Admin</USERTYPE>
<SUBJECT />
<SESSION>hp0vtlg001</SESSION>
<SYSTEM>DS</SYSTEM>
<EVENTTYPE>USER_Supervisor</EVENTTYPE>
<EVENTID>VIEW</EVENTID>
<SIP>10.210.345.254</SIP>
<EVENTSTATUS>120</EVENTSTATUS>
<EMSG />
<STATUS>FALSE</STATUS>
<STIME>2022-06-02 19:10:57.967</STIME>
<VADDATA>2019:00-00002; 2019:00-0000002; 2019:00-00003</VADDATA>
<TIMEPERIOD />
<CODE />
<RTYPE />
<DTFTYPE />
<DIP>10.225.35.45</DIP>
<DEVICE>Laptop</DEVICE>
</MTData>
<MTData>
<USER>TEST06HLDC</USER>
<USERTYPE>Power</USERTYPE>
<SUBJECT />
<SESSION>hp2ftlg021</SESSION>
<SYSTEM>Test</SYSTEM>
<EVENTTYPE>USER_MANAGER</EVENTTYPE>
<EVENTID>Update</EVENTID>
<SIP>10.210.345.254</SIP>
<EVENTSTATUS>122</EVENTSTATUS>
<EMSG />
<STATUS>TRUE</STATUS>
<STIME>2022-06-02 19:20:57.967</STIME>
<VADDATA>2019:00-00012; 2019:00-0000002; 2019:00-00024</VADDATA>
<TIMEPERIOD />
<CODE />
<RTYPE />
<DTFTYPE />
<DIP>10.225.35.45</DIP>
<DEVICE>Laptop</DEVICE>
</MTData>
If what you are showing on the screenshot is the entire raw event, you should use KV_MODE setting in props.conf. It will be difficult and inefficient from performance stand point to write a regular expression which will properly handle the xml, especially to deal with elements which can be empty.
But if you really want to go that way, you can extract user and usertype values with the following regex (tested in regex101.com, not in Splunk):
<USER>(?<user>[^<]+)<\/USER>([\r\n]*)<USERTYPE>(?<usertype>[^<]+)
Assuming that xml you've provided is the raw data, you can use the following props.conf configuration:
[sourcetype_name]
BREAK_ONLY_BEFORE=\s*<MTData
CHARSET=UTF-8
KV_MODE=xml
LINE_BREAKER=\s*<MTData
MAX_EVENTS=1000
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=true
TIME_PREFIX=<STIME>
and deploy it to indexer and search head.
The KV_MODE=xml setting in the above props.conf should properly extract values for you.
Hello,
Thank you so much, please see the following, screenshot, I was trying to do field extraction here, but didn't have any field get extracted, any recommendation would be highly appreciated, thank you again.
If what you are showing on the screenshot is the entire raw event, you should use KV_MODE setting in props.conf. It will be difficult and inefficient from performance stand point to write a regular expression which will properly handle the xml, especially to deal with elements which can be empty.
But if you really want to go that way, you can extract user and usertype values with the following regex (tested in regex101.com, not in Splunk):
<USER>(?<user>[^<]+)<\/USER>([\r\n]*)<USERTYPE>(?<usertype>[^<]+)
Alternatively, you can use mode=sed so you don't have to manually program all field names.
| rex mode=sed "s/\s+(<[^>\/]+)>([^<]+)<.+/\n\1=\"\2\"/g"
| kv
Just note that these regex methods don't conform to XML. There is no guarantee that they will handle future events.
You shouldn't have to - probably shouldn't, anyway, extract XML with regex. Use builtin command spath. Using your data, spath gives
MTData.DEVICE | MTData.DIP | MTData.EVENTID | MTData.EVENTSTATUS | MTData.EVENTTYPE | MTData.SESSION | MTData.SIP | MTData.STATUS | MTData.STIME | MTData.SYSTEM | MTData.USER | ||
Laptop Laptop | 10.225.35.45 10.225.35.45 | VIEW Update | 120 122 | USER_Supervisor USER_MANAGER | hp0vtlg001 hp2ftlg021 | 10.210.345.254 10.210.345.254 | FALSE TRUE | 2022-06-02 19:10:57.967 2022-06-02 19:20:57.967 | DS Test | TEST05GLBC TEST06HLDC | Admin Power | 2019:00-00002; 2019:00-0000002; 2019:00-00003 2019:00-00012; 2019:00-0000002; 2019:00-00024 |
Hello,
Thank you so much for your quick response. But, how would I incorporate spath into inline field extraction?
Do you have a field containing the XML document? Or is the entire _raw in XML?
It's a raw!
| spath
assumes _raw as input. If you are just beginning to ingest this document, also consider set up a new sourcetype that uses XML as base type. In that case, you don't have to do anything in search. (Have you checked existing fields? Could they have already existed?)
Hello @yuanliu,
Thank you so much, I tried it before and worked from SH, but it didn't work on inline field extraction (setting->fields->field extractions).
"setting->fields->field extractions" only works with regex, which is unsuitable for structured data like XML. Use | spath in search command. (I was mistaken. There's no XML base type; index-time extraction only supports JSON. So, inline spath is the best choice.)
@yuanliu ,
Is there any way we can use regex? I know (you also mentioned) regex is not the suitable option. Thank you so much again!