Dashboards & Visualizations

Transform XML log data before indexing - How can I get this to work?

sc0tt
Builder

I have an semi-formatted XML log that is currently being indexed by Splunk with no problems. However, it indexes a lot of additional data that is not needed. In order to save on space I would like to transform this log and only index what is needed. I'm using Anonymize data as an example. Below is a sample of the log data and the desired formatted version and my current configuration.

log data

2013-10-23 12:22:17,286 INFO  ==== Outgoing ==== xml version="1.0" encoding="UTF-8"?> <INTERFACE_API><UserId>55555555555</UserId><MsgType>Response</MsgType><Key>001</Key><SessionID>1000</SessionID></INTERFACE_API><EOM>
2013-10-23 12:22:17,274 INFO  ==== Incoming ==== <INTERFACE_API><UserId>55555555555</UserId><MsgType>Request</MsgType><Item>5</Item><Internal>INTERNAL_VALUE</Internal><SessionID>999999999999</SessionID></INTERFACE_API>

desired indexed data

2013-10-23 12:22:17,286 UserId=55555555555 Key=001
2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999

props.conf

[test_io]
NO_BINARY_CHECK = 1
pulldown_type = 1
TRANSFORMS-xml = test-io-xml

transforms.conf

[test-io-xml]
REGEX = <(.*?)(?:\s[^>]*)?>([^<]*)</\\1>
FORMAT = $1=$2
DEST_KEY = _raw

I see no difference in my indexed data. I used the regex that is included in the xmlkv.py script because that does work in a search, so I figured that it is able to extract the xml values correctly.

How can I get this to work?

update:
I've also tried modifying my props.conf to use SEDCMD. I added a script that will remove all tags to see if I'm able to alter the indexed data but there is still no effect.

[test_io]
NO_BINARY_CHECK = 1
pulldown_type = 1
SEDCMD-testio = s/<[^<>]\{1,\}>//g

update 2:
The above sed script doesn't appear to work but I was able to get another test script to work. Now that I know the props.conf is working correctly, is there a way to remove the xml tags and create a key=value pair with sed?

0 Karma
1 Solution

sc0tt
Builder

I was able to use a sed script like below to extract and format the fields I wanted.

SEDCMD-testio = s/(.*) INFO.*<UserId>(.*)<\/UserId>.*<Item>(.*)<\/Item>.*<SessionID>(.*)<\/SessionID>.*/\1 UserId=\2 Item=\3 SessionID=\4/

The indexed data then is formatted as

2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999

View solution in original post

sc0tt
Builder

I was able to use a sed script like below to extract and format the fields I wanted.

SEDCMD-testio = s/(.*) INFO.*<UserId>(.*)<\/UserId>.*<Item>(.*)<\/Item>.*<SessionID>(.*)<\/SessionID>.*/\1 UserId=\2 Item=\3 SessionID=\4/

The indexed data then is formatted as

2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...