Splunk Search

xml field extraction with a twist

Contributor

Data example:

<Asset href="/company/rest-1.v1/Data/Story/2530981/6709286" id="Story:2530981:6709286"><Attribute name="Status.Name">Ready</Attribute><Attribute name="Number">B-107445</Attribute><Attribute name="Name">Upgrade Splunk Windows TA</Attribute><Attribute name="ChangeDate">2020-01-29T13:49:44.337</Attribute><Attribute name="CreateDate">2019-03-12T12:49:22.703</Attribute><Attribute name="Owners.Name"><Value>owner one</Value><Value>owner two</Value></Attribute></Asset>

&

<Asset href="/company/rest-1.v1/Data/Story/3644941/6720976" id="Story:3644941:6720976"><Attribute name="Status.Name">Ready</Attribute><Attribute name="Number">B-143465</Attribute><Attribute name="Name">Review/Upgrade Splunk_TA_Nix to v7</Attribute><Attribute name="ChangeDate">2020-01-30T12:54:07.103</Attribute><Attribute name="CreateDate">2020-01-15T10:40:49.307</Attribute><Attribute name="Owners.Name"><Value>owner one</Value></Attribute></Asset>

I've gotten my XML to seperate into events finally, but I'm being thrown by trying to get the fields to work. I'd like to have
Status.Name = Ready
Number = B-143465
ChangeDate = 2020-01-30T12:54:07.103
and so on

I created this regex using the field extractor and regex101:

^(?:[^>\n]*>){2}(?P<Status_Name>\w+\s+\w+|\w+)(?:[^>\n]*>){2}(?P<Number>\w+\-\d+)[^ \n]* \w+="\w+">(?P<Name>[^<]+)[^ \n]* \w+="\w+">(?P<ChangeDate>[^<]+)(?:[^"\n]*"){2}>(?<CreateDate>[^<]+)(?:[^"\n]*"){2}><\w+>(?P<Owners_Name>\w+\s+\w+)

which gets me most of the way there, but it won't work for the multiple owner values.
Can someone suggest a fix here? Also, if you could also suggest some help in implementing the regex in a transforms, I'd appreciate it. I think I can call it using
PROPS

...
REPORT-V1 = v1_fields

TRANSFORMS

[v1_fields]
REGEX = ^(?:[^>\n]*>){2}(?P<Status_Name>\w+\s+\w+|\w+)(?:[^>\n]*>){2}(?P<Number>\w+\-\d+)[^ \n]* \w+="\w+">(?P<Name>[^<]+)[^ \n]* \w+="\w+">(?P<ChangeDate>[^<]+)(?:[^"\n]*"){2}>(?<CreateDate>[^<]+)(?:[^"\n]*"){2}><\w+>(?P<Owners_Name>\w+\s+\w+)

But I don't know if I need to add a FORMAT = $1::$2 line (nor do I know what that line does ... )

Any help you can provide here would be great.
I've also tried KV_MODE=xml on the search head, but that doesn't give me the field names I want, just values for
Asset.Attribute
Asset.Attribute.Value
etc

Thanks

0 Karma
1 Solution

Ultra Champion

transforms.conf

  • For example, the following are equivalent for search-time field extractions:
    • Using FORMAT:
      • REGEX = ([a-z]+)=([a-z]+)
      • FORMAT = $1::$2
    • Without using FORMAT
      • REGEX = (?<_KEY_1>[a-z]+)=(?<_VAL_1>[a-z]+)
    • When using either of the above formats, in a search-time extraction, the regular expression attempts to match against the source text, extracting as many fields as can be identified in the source text.

FORMAT ver:

REGEX = \<Attribute name=\"([^\"]+)\"\>(?:\<Value\>)?(.*?)(?:\<\/Value\>)?\<\/Attribute\>
FORMAT = $1::$2

regexr.com/4vca1

View solution in original post

Ultra Champion

transforms.conf

  • For example, the following are equivalent for search-time field extractions:
    • Using FORMAT:
      • REGEX = ([a-z]+)=([a-z]+)
      • FORMAT = $1::$2
    • Without using FORMAT
      • REGEX = (?<_KEY_1>[a-z]+)=(?<_VAL_1>[a-z]+)
    • When using either of the above formats, in a search-time extraction, the regular expression attempts to match against the source text, extracting as many fields as can be identified in the source text.

FORMAT ver:

REGEX = \<Attribute name=\"([^\"]+)\"\>(?:\<Value\>)?(.*?)(?:\<\/Value\>)?\<\/Attribute\>
FORMAT = $1::$2

regexr.com/4vca1

View solution in original post

Contributor

That works in regex101, to an extent.
The Owners.Name field keeps the closed/open tags between the names, like

owner one< /value> <value>owner two

Is there any way around this, or is this the best that can happen?
and this is search time field extractions, so I need to put it on the search head, not the ingest host. thanks for that.

Also, thank you for your help, and for explaining the transforms.

0 Karma

Ultra Champion
[first trans]
REGEX = \<Attribute name=\"([^\"]+)\"\>(.*?)\<\/Attribute\>
FORMAT = $1::$2

[second trans]
SOURCE_KEY = "Owners.Name"
REGEX = \<value\>(.*?)\<\/value\> 
FORMAT = Owners_name::$1
MV_ADD = true
0 Karma

Contributor

Thanks for your help. Unfortunately, I'm still getting
< Value>name one< /Value>< Value>name two< /Value>
minus the spaces.

Transforms.conf is :
[version1_fields]
REGEX = <Attribute name=\"([^\"]+)\">(.*?)<\/Attribute>
FORMAT=$1::$2

[v1_ownername]
SOURCE_KEY = "Owners.Name"
REGEX = \<Value\>(.*?)\<\/Value\>
FORMAT = Owners.Name::$1
MV_ADD = true

I make the Value uppercase in the regex, and adjusted Format from Owners_name to Owners.Name, but no help. Props looks like:

[version1_xml]
REPORT-v1 = version1_fields
REPORT-v12 = v1_ownername

Update:
I changed props to
[version1_xml]
REPORT-v1 = version1_fields,v1_ownername
and restarted, but nothing useful happened unfortunately, still seeing the multiple values in the same value surrounded by the < Value>< \Value>

0 Karma

Ultra Champion

second trans aims only to extract field.
Owners.Name 's value has < Value> and how's Owners_name ?
Why I separated fields is to check field name correct.
If Owners_name is nothing, you should fix it.

0 Karma

Contributor

This is the final props & transforms that finally worked, thanks again for all your help

Transforms:

[version1_fields]
REGEX = \<Attribute name=\"([^\"]+)\"\>(.*?)\<\/Attribute\>
FORMAT=$1::$2

[v1_ownername]
SOURCE_KEY = Owners_Name
REGEX = \<Value\>(?<Owner>.*?)\<\/Value\>
MV_ADD = true
0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!