topic Can you help me with a problem extracting XML? in Getting Data In

Can you help me with a problem extracting XML?

manderson7 — Tue, 29 Sep 2020 23:21:25 GMT

I've scoured Google and Answers, but my XML looks a little different than most I've seen so far:

 <Doc_OutPut XML_Version="1.0">
      <Doc_Field>
        <Field_Name>BatchName</Field_Name>
<Field_Value>GOCLM36962920190214001_19045SCLM000018</Field_Value>
      </Doc_Field>
      <Doc_Field>
        <Field_Name>GUID</Field_Name>
        <Field_Value>
        </Field_Value>
      </Doc_Field>
      <Doc_Field>
        <Field_Name>ph_Template</Field_Name>
        <Field_Value>
        </Field_Value>
      </Doc_Field>
      <Doc_Field>
        <Field_Name>phEmp_Template</Field_Name>
        <Field_Value>-Initial – Company</Field_Value>
      </Doc_Field>
      <Doc_Field>
        <Field_Name>phPhy_Template</Field_Name>
        <Field_Value>
        </Field_Value>
      </Doc_Field>
  </Doc_OutPut>

I'd like to get Splunk to display the field_value as the value and field_name as the name of the field. I've tried
props.conf:

DATETIME_CONFIG = CURRENT
SHOULD_LINEMERGE = false
BREAK_ONLY_BEFORE = /<Doc_Field/>

What am I doing wrong here?

Re: Can you help me with a problem extracting XML?

chrisyounger — Tue, 29 Sep 2020 23:26:25 GMT

BREAK_ONLY_BEFORE is for splitting the data into multiple events so I don't think its what you are trying to do.

To get the fields extracted like you want, You can use this (put it on your search head):

props.conf

[my_sourcetype]
REPORT-my_xml_pairs = my_xml_pairs

transforms.conf

[my_xml_pairs]
REGEX = <Field_Name>\s*(?<_KEY_1>.*?)\s*<\/Field_Name>.*?<Field_Value>\s*(?<_VAL_1>.*?)\s*<\/Field_Value>.*?

Good luck

Re: Can you help me with a problem extracting XML?

manderson7 — Tue, 26 Feb 2019 20:53:20 GMT

Thanks very much, Chris. You're right, I believe I do want all the data in the text doc to show as 1 event.
Unfortunately, this did not extract the field names from the XML, and not all of the fields were in the 1 event. I ingested 1 file and got an event that was 257 lines long, and the rest of the lines were as their own event, and it didn't extract the field names.
I ingested another file of the same type, but I added a \n in between & , but this didn't help w/ the field name extraction. I again got 1 event w/ 257 lines, and the rest of the lines were in their own events.
It worked on regex101, so I'm not sure what happened.
Do you have any ideas what could be the problem?
I also tried adding LINEBREAKER = <\/Doc_OutPut> to the props, no go there either. The events still broke after 257 lines.

Re: Can you help me with a problem extracting XML?

chrisyounger — Tue, 29 Sep 2020 23:26:31 GMT

Using LINE_BREAKER is the best thing to do. If the split works on Regex101 then it should work in Splunk. However two tricks to be aware of:
1. Make sure you put the LINE_BREAKER where the parsing is happening, this usually means the indexer or the first heavy forwarder the data goes through.
2. Make sure you have a "capture group" in your regular expression otherwise it won't work. e.g. LINEBREAKER = \<\/Doc_OutPut\>([\r\n]*)

Re: Can you help me with a problem extracting XML?

manderson7 — Tue, 26 Feb 2019 22:01:27 GMT

LINE_BREAKER did the trick, with the capture group. Didn't know that was required.
Still not getting field names.
props.conf

[ocr_xml]
REPORT-ocr_xml_pairs = ocr_xml_pairs

transforms.conf

[ocr_xml_pairs]
REGEX = `|<Field_Name>\s*(?<Name>.*?)\s*<\/Field_Name>\n.*?<Field_Value>\s*(?<_Value>.*?)\s*<\/Field_Value>.*?