Dashboards & Visualizations

need help indexing a simple XML file


I work with a file delivery system that relies on an xml "index" file that acts as a sort of manifest of files available for download in a given data set. I need to index these xml files so we can search and report on them in Splunk. While the files are fairly simple in construction, I am having a problem when trying to get them indexed cleanly.

Here is a sample of an xml file:

<?xml version="1.0" encoding="UTF-8"?>


Here is my props.conf stanza:

KV_MODE = xml
category = Structured
description = JKCS Index file
disabled = false
pulldown_type = true
LINE_BREAKER = (\s*)</Record>(\s*)<Record>(\s*)
REPORT-jkcsxml = jkcsxml
TRANSFORMS-nullIndexHeader = nullIndexHeader

From the transforms.conf, here is the nullIndexHeader stanza to remove the header and extra tags:

REGEX = (?m)^(<\?xml)|(\<DSIF\>)|(\<Heading\>)|(\<\/GenDate\>)|(\<\/Root\>)|(\<\/Heading\>)|(\<\/Record\>)|(\<\/DSIF\>)
DEST_KEY = queue
FORMAT = nullQueue

And here is the transforms.conf stanza to break out the xml tags:

REGEX = <([^>]+)>([^<]*)</\1\>
FORMAT = $1::$2
MV_ADD = true

So my main problem is that after all of this, when I try to output a simple table, all of the results get doubled, like this:
alt text

Why is that? Where is this duplication coming from? When I do a simple search to show the raw events, only one of each record is listed. I lot of this is cobbled together from other answers that have been posted here, and some of it I don't entirely understand. I've been fighting regular expressions all last week just to get the fields extracted (because what works for me at regex101.com doesn't seem to apply in the LINE BREAKER in the Add Data GUI), but I can't figure out this doubling of the results in the table.

Help greatly appreciated!

0 Karma
1 Solution

Path Finder

From what it looks like, it seems you only care about the data between the <Record></Record> tags, correct?

If this is the case, you could do something along these lines.

The below configurations will trim the unwanted results using SED (regex may need to be altered if your data appears differently than the sample data previously provided). It includes a line breaker that breaks on the new record and also includes the date field which seems to be in the event. If this is not the correct date, you can simply set DATETIME_CONFIG = CURRENT or your desired value and remove the other time properties. I have set the kv mode to none as the fields are being extracted manually.


SEDCMD-0_remove_header = s/(<\?xml[^\>]+\>\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+)//g
SEDCMD-1_remove_footer = s/(\<\/DSIF\>)//g
LINE_BREAKER = ([\s\n\r]+)(<Record>)
KV_MODE = none
REPORT-0_fields = your_custom_fields


REGEX  = \s+\<([^\>]+)\>([^\<\n\r]+)\<\/
FORMAT = $1::$2

This configuration will index the data as follows


Notice – the starting tags are not being indexed. Use this configuration only if that is your desired outcome

View solution in original post

0 Karma

Path Finder

From what it looks like, it seems you only care about the data between the <Record></Record> tags, correct?

If this is the case, you could do something along these lines.

The below configurations will trim the unwanted results using SED (regex may need to be altered if your data appears differently than the sample data previously provided). It includes a line breaker that breaks on the new record and also includes the date field which seems to be in the event. If this is not the correct date, you can simply set DATETIME_CONFIG = CURRENT or your desired value and remove the other time properties. I have set the kv mode to none as the fields are being extracted manually.


SEDCMD-0_remove_header = s/(<\?xml[^\>]+\>\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+)//g
SEDCMD-1_remove_footer = s/(\<\/DSIF\>)//g
LINE_BREAKER = ([\s\n\r]+)(<Record>)
KV_MODE = none
REPORT-0_fields = your_custom_fields


REGEX  = \s+\<([^\>]+)\>([^\<\n\r]+)\<\/
FORMAT = $1::$2

This configuration will index the data as follows


Notice – the starting tags are not being indexed. Use this configuration only if that is your desired outcome

0 Karma


Sorry for the radio silence, got pulled away to other things.

That is exactly what I need. I will test that out right now.

0 Karma


That worked perfectly, thank you very much!

The timestamp in the field is just the modification date of the file in that recordd. The date/time I really need to be assigned to each 'event' in Splunk is the GenDate value from the top of the file, but couldn't figure out a way to use it for each event. So for now, i'm just going with the CURRENT. The data is only good for about 24 hours, when the process is run again.

0 Karma


I had a similar problem, but with JSON. Can you try setting KV_MODE to none? Something similar to this question https://answers.splunk.com/answers/626871/double-field-extraction-for-the-json-data.html

0 Karma


With KV_MODE set to none, my LINE_BREAKER regex no longer works (never understood why it worked, anyway).

0 Karma


Looking at another question about LINE_BREAKER on XML, I think yours should be something like (\<Record\>).

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise Security 8.0!

Join us on Wednesday, November 20 to learn about Splunk Enterprise Security 8.0!To enhance SOC efficiency, ...

Mastering Threat Hunting

Register to watch Mastering Threat Hunting on Monday, November 18Join us for an insightful talk where we dive ...

Upcoming Community Maintenance: 10/28

Howdy folks, just popping in to let you know that the Splunk Community site will be in read-only mode ...