Splunk Search

XML input line-breaking and field extraction - how?

Justin_Grant
Contributor

I am trying to index an XML file which looks like this:

 <?xml version="1.0" encoding="utf-8" ?> 
 <Posts2Votes>
  <row>
   <Id>1</Id> 
   <PostId>7</PostId> 
   <UserId>2</UserId> 
   <VoteTypeId>2</VoteTypeId> 
   <CreationDate>2009-11-06T02:22:37.063</CreationDate> 
   <TargetUserId>7</TargetUserId> 
   <TargetRepChange>10</TargetRepChange> 
   <IPAddress>64.127.105.60</IPAddress> 
  </row>
  <row>
   <Id>2</Id> 
   <PostId>6</PostId> 
   <UserId>2</UserId> 
   <VoteTypeId>2</VoteTypeId> 
   <CreationDate>2009-11-06T02:22:38.25</CreationDate> 
   <TargetUserId>31</TargetUserId> 
   <TargetRepChange>10</TargetRepChange> 
   <IPAddress>64.127.105.60</IPAddress> 
  </row>
  <!-- more "row" elements go here -->
 </Posts2Votes>

Splunk's default parser will recognizes the timestamps correctly but does not split the events on each <row> element, and no fields are extracted by default. OK, now I need to figure out how to extract these fields and break the lines correctly. Any ideas?

1 Solution

gkanapathy
Splunk Employee
Splunk Employee

props.conf

TIME_PREFIX = \<CreationDate\>
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N
SHOULD_LINEMERGE = false
LINE_BREAKER = \>\s*(?=\<row\>)
REPORT-xmlext = xml-extr

transforms.conf

[xml-extr]
REGEX = \<(\w+)\>([^\>]*)\<\1\>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

should do it.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

props.conf

TIME_PREFIX = \<CreationDate\>
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N
SHOULD_LINEMERGE = false
LINE_BREAKER = \>\s*(?=\<row\>)
REPORT-xmlext = xml-extr

transforms.conf

[xml-extr]
REGEX = \<(\w+)\>([^\>]*)\<\1\>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

should do it.

charlie_park2
Explorer

Thanks. This is a very helpful post. The documentation really should be a lot more newbie-friendly. Thanks.

0 Karma

woodcock
Esteemed Legend

This is tested working:

REGEX = <([^>]+)>([^<]*)<\/\1>
0 Karma

gljiva
Path Finder

There is a small error in above regex, correct one is:

REGEX = \<(\w+)\>([^\<]*)\</\1\>
0 Karma

BunnyHop
Contributor

Where you able to get this work? I tried it but it does not break the events from one another cleanly.

I do have a subdata within the top group, so after the row group, I have a subrow that contains data for the row group, so that might be what's skewing me.

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Community Content Calendar, September edition

Welcome to another insightful post from our Community Content Calendar! We're thrilled to continue bringing ...

Splunkbase Unveils New App Listing Management Public Preview

Splunkbase Unveils New App Listing Management Public PreviewWe're thrilled to announce the public preview of ...

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you leveraging automation to its fullest potential in your threat detection strategy?Our upcoming Security ...