Solved: Need help to parse & flatten XML Attribute data in...

anirbandasdeb · ‎08-22-2018

We have data coming in XML in the following format:

Sample Event 1:

<Machine ServerID="ArCRw3ep7S1GJTEl4ZNgwJHWLjSd446mG2lyLus7Ira8" MachineID="3D3F7ECE-D01E-98A7-4F05-871905F2AB37" Prefix="3D" Name="TRAY PACKER NO.4">
    <Data Version="5.0.6392.4047">
        <Event Start="08/22/2018 11:00:00.000" End="08/22/2018 11:56:55.310" Job="4009396" Instance="7c7e630aab6b4ed2833920037203ec17" Shift="Shift 1">
            <Status Name="Running" StatusCount="28" Duration="3289562.2199" Production="973" Scrap="1" />
            <Status Name="Starved" StatusCount="28" Duration="125748" />
        </Event>        
    </Data>
</Machine>

Sample Event 2:

<Machine ServerID="ArCRw3ep7S1GJTEl4ZNgwJHWLjSd446mG2lyLus7Ira8" MachineID="3D3F7ECE-D01E-98A7-4F05-871905F2AB37" Prefix="3D" Name="TRAY PACKER NO.4">
    <Data Version="5.0.6392.4047">
        <Event Start="08/22/2018 11:00:00.000" End="08/22/2018 11:56:55.310" Job="4009396" Instance="7c7e630aab6b4ed2833920037203ec17" Shift="Shift 1">
            <Status Name="Running" StatusCount="28" Duration="3289562.2199" Production="973" Scrap="1" />
            <Status Name="Starved" StatusCount="28" Duration="125748" />
        </Event>
        <Event Start="08/22/2018 11:56:55.310" End="08/22/2018 12:56:55.310" Job="4009375" Instance="7c7e630aab6b4ed2833920037203ec17" Shift="Shift 1">
            <Status Name="Running" StatusCount="28" Duration="3289562.2199" Production="973" Scrap="1" />
            <Status Name="Starved" StatusCount="28" Duration="125748" />
        </Event>
    </Data>
</Machine>

Please note that the data is exclusively in XML attributes, and not in elements.
I am aware that we can possibly do it via Python pre-processing, but for now we need to flatten out the data using SPL.

We have tried multiple combinations of spath and mvexpand. However, since data is in attribute tags, we cannot split it into separate rows to show in a table form, when it is of the form given in the second XML event.

I am not sure we can handle this using a regex since, apart from a few, the attributes are not uniform throughout.

Can someone please help?

Thanks in advance.

Regards,
Anirban.

niketn · ‎08-23-2018

@anirbandasdeb, if possible it is better if you get Null fields inserted as NA or any other default values before data gets indexed in Splunk rather than doing this at Search Time. While using spath command the DOM structure should remain the same. Also the XML structure is complex as meaningful data is available at different hierarchy (several interesting fields at different levels).

However, to begin with you can try the following run anywhere search example based on Sample Data provided in the question.
PS:
1. I have replaced empty values for Production and Scrap with NA using eval on _raw data, which will be an expensive query considering total events you will operate upon.
2. I extracted interesting fields from two levels in the XML. i.e. Machine.Data.Event node attributes and the Event tree below it. If you need more data to be extracted, similar concept needs to be extended at each level i.e.
a. mvzip() to stitch multi value fields together
b. mvexpand to form separate events
c. split()/makemv to split the values as fields again.

| makeresults
| eval _raw="<Machine ServerID=\"ArCRw3ep7S1GJTEl4ZNgwJHWLjSd446mG2lyLus7Ira8\" MachineID=\"3D3F7ECE-D01E-98A7-4F05-871905F2AB37\" Prefix=\"3D\" Name=\"TRAY PACKER NO.4\">
     <Data Version=\"5.0.6392.4047\">
         <Event Start=\"08/22/2018 11:00:00.000\" End=\"08/22/2018 11:56:55.310\" Job=\"4009396\" Instance=\"7c7e630aab6b4ed2833920037203ec17\" Shift=\"Shift 1\">
             <Status Name=\"Running\" StatusCount=\"28\" Duration=\"3289562.2199\" Production=\"973\" Scrap=\"1\" />
             <Status Name=\"Starved\" StatusCount=\"28\" Duration=\"125748\" />
         </Event>
         <Event Start=\"08/22/2018 11:56:55.310\" End=\"08/22/2018 12:56:55.310\" Job=\"4009375\" Instance=\"7c7e630aab6b4ed2833920037203ec17\" Shift=\"Shift 1\">
             <Status Name=\"Running\" StatusCount=\"28\" Duration=\"3289562.2199\" Production=\"973\" Scrap=\"1\" />
             <Status Name=\"Starved\" StatusCount=\"28\" Duration=\"125748\" />
         </Event>
     </Data>
 </Machine>"
| append [| makeresults
| eval _raw=" <Machine ServerID=\"ArCRw3ep7S1GJTEl4ZNgwJHWLjSd446mG2lyLus7Ira8\" MachineID=\"3D3F7ECE-D01E-98A7-4F05-871905F2AB37\" Prefix=\"3D\" Name=\"TRAY PACKER NO.4\">
     <Data Version=\"5.0.6392.4047\">
         <Event Start=\"08/22/2018 11:00:00.000\" End=\"08/22/2018 11:56:55.310\" Job=\"4009396\" Instance=\"7c7e630aab6b4ed2833920037203ec17\" Shift=\"Shift 1\">
             <Status Name=\"Running\" StatusCount=\"28\" Duration=\"3289562.2199\" Production=\"973\" Scrap=\"1\" />
             <Status Name=\"Starved\" StatusCount=\"28\" Duration=\"125748\" />
         </Event>        
     </Data>
 </Machine>"]
| eval _raw=replace(_raw,"(Duration=\"[^\"]+\")\s\/\>","\1 Production=\"NA\" Scrap=\"NA\" />")
| spath path=Machine.Data.Event output=Event
| spath path=Machine.Data.Event{@Start} output=Start
| spath path=Machine.Data.Event{@End} output=End
| spath path=Machine.Data.Event{@Job} output=Job
| eval data=mvzip(Job,(mvzip(Start,mvzip(End,Event))))
| fields - _*, Start End Job Event
| mvexpand data
| makemv data delim=","
| eval Job=mvindex(data,0),Start=mvindex(data,1),End=mvindex(data,2),_raw=mvindex(data,3)
| fields - data
| spath
| rename "Status{@*}" as *
| eval data=mvzip(Name,(mvzip(Production,mvzip(Scrap,mvzip(StatusCount,Duration)))))
| fields - _*, Name Production Scrap StatusCount Duration
| mvexpand data
| makemv data delim=","
| eval Name=mvindex(data,0),Production=mvindex(data,1),Scrap=mvindex(data,2),StatusCount=mvindex(data,3),Duration=mvindex(data,4)
| fields - Data
| table Job Start End Name Production Scrap StatusCount Duration

PS: Since most of multi valued commands and eval on _raw data would be expensive to be performed at Search Time, you should consider following performance improvement methods:

Check feasibility of streamlining XML data at source system to have all attributes/field name present (with default value instead of no field name at all).
Ideally if you Have Forwarder, see if Regular Expression based replace can be used to set default values in case fields/attributes are missing in XML data i.e. through REGEX transforms or SEDCMD
If you have sufficient license, you can use collect to perform search similar to the one above to extract all required fields and index to a new index by scheduling the search query with collect command. Your actual Dashboard/Report query will run on the new index only. License cost will be based on whether you want default sourcetype=stash to be overridden with your custom sourcetype or not. Read documentation for details.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

View solution in original post

niketn · ‎08-23-2018

@anirbandasdeb, if possible it is better if you get Null fields inserted as NA or any other default values before data gets indexed in Splunk rather than doing this at Search Time. While using spath command the DOM structure should remain the same. Also the XML structure is complex as meaningful data is available at different hierarchy (several interesting fields at different levels).

However, to begin with you can try the following run anywhere search example based on Sample Data provided in the question.
PS:
1. I have replaced empty values for Production and Scrap with NA using eval on _raw data, which will be an expensive query considering total events you will operate upon.
2. I extracted interesting fields from two levels in the XML. i.e. Machine.Data.Event node attributes and the Event tree below it. If you need more data to be extracted, similar concept needs to be extended at each level i.e.
a. mvzip() to stitch multi value fields together
b. mvexpand to form separate events
c. split()/makemv to split the values as fields again.

| makeresults
| eval _raw="<Machine ServerID=\"ArCRw3ep7S1GJTEl4ZNgwJHWLjSd446mG2lyLus7Ira8\" MachineID=\"3D3F7ECE-D01E-98A7-4F05-871905F2AB37\" Prefix=\"3D\" Name=\"TRAY PACKER NO.4\">
     <Data Version=\"5.0.6392.4047\">
         <Event Start=\"08/22/2018 11:00:00.000\" End=\"08/22/2018 11:56:55.310\" Job=\"4009396\" Instance=\"7c7e630aab6b4ed2833920037203ec17\" Shift=\"Shift 1\">
             <Status Name=\"Running\" StatusCount=\"28\" Duration=\"3289562.2199\" Production=\"973\" Scrap=\"1\" />
             <Status Name=\"Starved\" StatusCount=\"28\" Duration=\"125748\" />
         </Event>
         <Event Start=\"08/22/2018 11:56:55.310\" End=\"08/22/2018 12:56:55.310\" Job=\"4009375\" Instance=\"7c7e630aab6b4ed2833920037203ec17\" Shift=\"Shift 1\">
             <Status Name=\"Running\" StatusCount=\"28\" Duration=\"3289562.2199\" Production=\"973\" Scrap=\"1\" />
             <Status Name=\"Starved\" StatusCount=\"28\" Duration=\"125748\" />
         </Event>
     </Data>
 </Machine>"
| append [| makeresults
| eval _raw=" <Machine ServerID=\"ArCRw3ep7S1GJTEl4ZNgwJHWLjSd446mG2lyLus7Ira8\" MachineID=\"3D3F7ECE-D01E-98A7-4F05-871905F2AB37\" Prefix=\"3D\" Name=\"TRAY PACKER NO.4\">
     <Data Version=\"5.0.6392.4047\">
         <Event Start=\"08/22/2018 11:00:00.000\" End=\"08/22/2018 11:56:55.310\" Job=\"4009396\" Instance=\"7c7e630aab6b4ed2833920037203ec17\" Shift=\"Shift 1\">
             <Status Name=\"Running\" StatusCount=\"28\" Duration=\"3289562.2199\" Production=\"973\" Scrap=\"1\" />
             <Status Name=\"Starved\" StatusCount=\"28\" Duration=\"125748\" />
         </Event>        
     </Data>
 </Machine>"]
| eval _raw=replace(_raw,"(Duration=\"[^\"]+\")\s\/\>","\1 Production=\"NA\" Scrap=\"NA\" />")
| spath path=Machine.Data.Event output=Event
| spath path=Machine.Data.Event{@Start} output=Start
| spath path=Machine.Data.Event{@End} output=End
| spath path=Machine.Data.Event{@Job} output=Job
| eval data=mvzip(Job,(mvzip(Start,mvzip(End,Event))))
| fields - _*, Start End Job Event
| mvexpand data
| makemv data delim=","
| eval Job=mvindex(data,0),Start=mvindex(data,1),End=mvindex(data,2),_raw=mvindex(data,3)
| fields - data
| spath
| rename "Status{@*}" as *
| eval data=mvzip(Name,(mvzip(Production,mvzip(Scrap,mvzip(StatusCount,Duration)))))
| fields - _*, Name Production Scrap StatusCount Duration
| mvexpand data
| makemv data delim=","
| eval Name=mvindex(data,0),Production=mvindex(data,1),Scrap=mvindex(data,2),StatusCount=mvindex(data,3),Duration=mvindex(data,4)
| fields - Data
| table Job Start End Name Production Scrap StatusCount Duration

PS: Since most of multi valued commands and eval on _raw data would be expensive to be performed at Search Time, you should consider following performance improvement methods:

Check feasibility of streamlining XML data at source system to have all attributes/field name present (with default value instead of no field name at all).
Ideally if you Have Forwarder, see if Regular Expression based replace can be used to set default values in case fields/attributes are missing in XML data i.e. through REGEX transforms or SEDCMD
If you have sufficient license, you can use collect to perform search similar to the one above to extract all required fields and index to a new index by scheduling the search query with collect command. Your actual Dashboard/Report query will run on the new index only. License cost will be based on whether you want default sourcetype=stash to be overridden with your custom sourcetype or not. Read documentation for details.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

anirbandasdeb · ‎08-23-2018

@niketnilay this is a lot!!
Thanks for looking into this at depth.

Regarding the source system, its a third party tool called Shoplogix, and data is pulled via web queries.
Ideally I would have preferred JSON over XML, but for this result set, they have only XML response to the web query and we cannot streamline the response any further without Python processing. 😞

The summary index idea sounds promising as well.

I will test these out and come back with more updates.

Thanks again!!

anirbandasdeb · ‎08-23-2018

@niketnilay how do you add new python modules to the Splunk Python?

My idea is to convert the XML to JSON using python, in the same python input script that we have scheduled to import the data.

The inbuilt Splunk python might not have XML-JSON conversion modules and we might need to import/install those specific modules to do this in the script.

niketn · ‎08-23-2018

@anirbandasdeb I am not sure at which stage processing input XML data you want to integrate your custom Script. Before indexing to Splunk (Scripted Input) or During Search Time (when it does not matter whether the data is XML or JSON as spath command remains the same for both).

If you are configuring your own Scripted input (You should eventually consider Splunk Add On Builder once the script is fully tested). You can refer to the following Splunk Answers with some leads: https://answers.splunk.com/answers/612626/integrating-splunk-dashboard-with-python.html

Do try out and confirm. Also Accept Answer/ Up Vote comments that help once this works for you.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

anirbandasdeb · ‎08-26-2018

@niketnilay

How do I import libraries into the script??

Ex.: https://docs.splunk.com/Documentation/Splunk/7.1.2/AdvancedDev/ScriptExample#Script_example.2C_poll_...

Here they are importing the _mssql library. If this library is not native to the Splunk Python, how do I install it? Can I just put the library in the same folder as the script?

I have not used the Add-On Builder yet, hence ignorant of the options it provides to deploy scripts.

niketn · ‎08-27-2018

Yes you can try that.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

anirbandasdeb · ‎08-22-2018

@ppablo_splunk , @mstjohn_splunk Thank You for updating and correcting my English!! I was absolutely exhausted after a long & grueling day and didn't stop to check my grammar.

mstjohn_splunk · ‎08-23-2018

@anirbandasdeb Just tweaked a few things! No problem! It is a good question 🙂

Need help to parse & flatten XML Attribute data in nested format.

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas

What Is the Name of the USB Key Inserted by Bob Smith? (BOTS Hint, Not the Answer)

Automating Threat Operations and Threat Hunting with Recorded Future

Join the Conversation

Need help to parse & flatten XML Attribute data in nested format.

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas

What Is the Name of the USB Key Inserted by Bob Smith? (BOTS Hint, Not the Answer)

Automating Threat Operations and Threat Hunting with Recorded Future