Hi, I have an XML-like (but not proper XML) feed that I need to parse.
A sample is below, and I need to parse out each field.
Each field will not necessarily be in each event, so I need a method that will find it, without depending upon a previous field or the location within the event itself.
Can anyone help?
Apr 22 19:54:29 138.126.78.80 <STONEGATE_LOG><TIMESTAMP>2019-04-22 15:54:28</TIMESTAMP><LOGID>9999999</LOGID><NODEID>1.2.3.4</NODEID><FACILITY>Packet Filtering</FACILITY><TYPE>Notification</TYPE><EVENT>New connection</EVENT><ACTION>Allow</ACTION><SRC>4.5.6.7</SRC><DST>X.X.X.X</DST><SERVICE>HTTP</SERVICE><PROTOCOL>2</PROTOCOL><SPORT>12345</SPORT><DPORT>99</DPORT><RULEID>60732.1</RULEID><SRCIF>5</SRCIF><COMPID>some text here</COMPID><RECEPTIONTIME>2019-04-22 15:54:29</RECEPTIONTIME><SENDERTYPE>Firewall</SENDERTYPE><SITUATION>Connection_Allowed</SITUATION><EVENTID>99999999999</EVENTID></STONEGATE_LOG>
Hi,
To extract XML data at search time, you can use below config on Search Head.
props.conf
[yourSourcetype]
REPORT-test = xmlkv_alt
transforms.conf
[xmlkv_alt]
FORMAT = $1::$2
REGEX = <([^>]*)>([^<]*)<\/\1>
EDIT: Please find regex extraction with sample data on https://regex101.com/r/tJVD20/1
All these answers are missing this setting in transforms.conf:
MV_ADD = true
So the full stanza is:
[YourNameHere]
REGEX = <([^\/][^>]+)>(.*?)<\/[^>]+>
FORMAT = $1::$2
MV_ADD = true
This will not work because REPEAT_MATCH
is only valid for Indexed-time field extraction and solution which I have provided is for search time extraction.
Quite correct; I always get MV_ADD
and REPEAT_MATCH
confused. I have corrected my answer.
Thanks. This works quite well. Is there anyway of forcing field names to be lowercase?
You will have to stack a calculated field
on top of this using lower(fieldname)
.
I expect that a props.conf entry for calculated field would work with eval's lower()
Hi,
To extract XML data at search time, you can use below config on Search Head.
props.conf
[yourSourcetype]
REPORT-test = xmlkv_alt
transforms.conf
[xmlkv_alt]
FORMAT = $1::$2
REGEX = <([^>]*)>([^<]*)<\/\1>
EDIT: Please find regex extraction with sample data on https://regex101.com/r/tJVD20/1
Interesting, so the xml doesn't have to be well-formed, as the sample above isn't well-formed.
Amazing, because back-then, a similar solution for json was a big hit here - How can we extract a json document within an event?
We ended up with -
REPORT-extract = json_embedded
[json_embedded]
REGEX = "(\w+)"."(\S+?)"
FORMAT = $1::$2
Yes you can use regex for magic 😉
Thanks. I see them appearing on the regex site, but they don't appear as fields on the SH when I try that - are there additional steps requried?
If you modified config file directly then you need to restart splunk service or you can use /debug/refresh web endpoint
How will the fields appear? Will they automatically appear with the names?
Yes it will automatically appear, I have tested this config in my lab and it is working fine.