Splunk Search

XML input line-breaking and field extraction - how?

Justin_Grant
Contributor

I am trying to index an XML file which looks like this:

 <?xml version="1.0" encoding="utf-8" ?> 
 <Posts2Votes>
  <row>
   <Id>1</Id> 
   <PostId>7</PostId> 
   <UserId>2</UserId> 
   <VoteTypeId>2</VoteTypeId> 
   <CreationDate>2009-11-06T02:22:37.063</CreationDate> 
   <TargetUserId>7</TargetUserId> 
   <TargetRepChange>10</TargetRepChange> 
   <IPAddress>64.127.105.60</IPAddress> 
  </row>
  <row>
   <Id>2</Id> 
   <PostId>6</PostId> 
   <UserId>2</UserId> 
   <VoteTypeId>2</VoteTypeId> 
   <CreationDate>2009-11-06T02:22:38.25</CreationDate> 
   <TargetUserId>31</TargetUserId> 
   <TargetRepChange>10</TargetRepChange> 
   <IPAddress>64.127.105.60</IPAddress> 
  </row>
  <!-- more "row" elements go here -->
 </Posts2Votes>

Splunk's default parser will recognizes the timestamps correctly but does not split the events on each <row> element, and no fields are extracted by default. OK, now I need to figure out how to extract these fields and break the lines correctly. Any ideas?

1 Solution

gkanapathy
Splunk Employee
Splunk Employee

props.conf

TIME_PREFIX = \<CreationDate\>
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N
SHOULD_LINEMERGE = false
LINE_BREAKER = \>\s*(?=\<row\>)
REPORT-xmlext = xml-extr

transforms.conf

[xml-extr]
REGEX = \<(\w+)\>([^\>]*)\<\1\>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

should do it.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

props.conf

TIME_PREFIX = \<CreationDate\>
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N
SHOULD_LINEMERGE = false
LINE_BREAKER = \>\s*(?=\<row\>)
REPORT-xmlext = xml-extr

transforms.conf

[xml-extr]
REGEX = \<(\w+)\>([^\>]*)\<\1\>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

should do it.

charlie_park2
Explorer

Thanks. This is a very helpful post. The documentation really should be a lot more newbie-friendly. Thanks.

0 Karma

woodcock
Esteemed Legend

This is tested working:

REGEX = <([^>]+)>([^<]*)<\/\1>
0 Karma

gljiva
Path Finder

There is a small error in above regex, correct one is:

REGEX = \<(\w+)\>([^\<]*)\</\1\>
0 Karma

BunnyHop
Contributor

Where you able to get this work? I tried it but it does not break the events from one another cleanly.

I do have a subdata within the top group, so after the row group, I have a subrow that contains data for the row group, so that might be what's skewing me.

0 Karma
Get Updates on the Splunk Community!

Splunk Answers Content Calendar, June Edition

Get ready for this week’s post dedicated to Splunk Dashboards! We're celebrating the power of community by ...

What You Read The Most: Splunk Lantern’s Most Popular Articles!

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

See your relevant APM services, dashboards, and alerts in one place with the updated ...

As a Splunk Observability user, you have a lot of data you have to manage, prioritize, and troubleshoot on a ...