Splunk Search

How to extract XML data from mixed content into one field for later use with spath?

roshannon
New Member

I have a mixed output log that contains XML and non-XML data. I am looking to extract the XML data into a field that I can later use spath on to get individual fields. My sample data is below. I am looking to get the entire <root>*<\root> into a single field that later I can use spath to get individual fields that I might want to search on. I have seen other recommendations to put XML into a single field for later spath usage, but did not see how to do that.

2015 May 22 15:23:44:024 GMT -0700 BW.DomainDMSEvents-DomainDMSEvents-P01 User [BW-User] - Job-10003 [UtilityProcesses/CreateAuditTrail.process/Log]: AuditTrail: 10003|Projects/DomainDMSEvents/ProcDefs/Starters/PublishDMSScanEvents.process||file|||2015-05-22T15:23:44.022-07:00|DomainDMSEvents-DomainDMSEvents-P01||||false||
|<root>
    <messageIn>
        <channel>file</channel>
        <msgID>1432333424013</msgID>
        <corlID>1432333424013</corlID>
        <raw><?xml version="1.0" encoding="UTF-8"?>
   <ns0:EventSourceOuputNoContentClass xmlns:ns0="http://www.tibco.com/namespaces/tnt/plugins/file"><action>remove</action><timeOccurred>1432333424013</timeOccurred><fileInfo><fullName>/nfs/appdata/CTSE/OMS/DMS/DMSEvents.txt</fullName><fileName>DMSEvents.txt</fileName><location>/nfs/appdata/CTSE/OMS/DMS</location><configuredFileName>/nfs/appdata/CTSE/OMS/DMS/DMSEvents.txt</configuredFileName><type>file</type><readProtected>true</readProtected><writeProtected>true</writeProtected><size>5651</size><lastModified>2015-05-20T12:07:28-07:00</lastModified></fileInfo></ns0:EventSourceOuputNoContentClass></raw>
            <EMSHeaderProperties>
                <header>
                    <name>fileNewName</name>
                    <value>/nfs/appdata/CTSE/OMS/DMS/processed/DMSEvents.txt</value>
                </header>
                <header>
                    <name>fileName</name>
                    <value>/nfs/appdata/CTSE/OMS/DMS/DMSEvents.txt</value>
                </header>
                <header>
                    <name>timestamp</name>
                    <value>1432333424017</value>
                </header>
            </EMSHeaderProperties>
            <parsed>
                <type>filePoller</type>
                <other/>
            </parsed>
        </messageIn>
        <messageOut>
            <name>DocImageEvent</name>
            <TXInfo>
                <tranType>DocImageEvent</tranType>
                <evtType>DocImageEvent</evtType>
                <topicOverride>Domain.CTS.CTSE.Canonical.S2C.DomainDMSEvents.DocImageEvent</topicOverride>
            </TXInfo>
        </messageOut>
        <psDef>
            <funcArea>S2C</funcArea>
            <appSource>DomainDMSEvents</appSource>
            <txIdentifier>DocImageEvent</txIdentifier>
            <startTS>1432333424017</startTS>
        </psDef>
    </root>|
0 Karma
1 Solution

maciep
Champion

Not sure how consistent that log format is, but something like this seems to work for me in a limited test env. I'm just using rex to grab the "*" portion of the event and throw it in a field called xml_field

... | rex "(?<xml_field>\<root\>[\s\S]+\<\/root\>)"

View solution in original post

maciep
Champion

Not sure how consistent that log format is, but something like this seems to work for me in a limited test env. I'm just using rex to grab the "*" portion of the event and throw it in a field called xml_field

... | rex "(?<xml_field>\<root\>[\s\S]+\<\/root\>)"
Get Updates on the Splunk Community!

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

 Prepare to elevate your security operations with the powerful upgrade to Splunk Enterprise Security 8.x! This ...

Get Early Access to AI Playbook Authoring: Apply for the Alpha Private Preview ...

Passionate about security automation? Apply now to our AI Playbook Authoring Alpha private preview ...

Reduce and Transform Your Firewall Data with Splunk Data Management

Managing high-volume firewall data has always been a challenge. Noisy events and verbose traffic logs often ...