Splunk Search

Multi-valued Index-time key extraction not working, please help!

jonaubrey
Explorer

Hi all,

I've been struggling with Splunk for weeks now (and had Developer training!) and I still can't get it to do what I want it to do, so here begins the first of many questions....

I'm attempting to build an app that does a single parse of some static data. Basically it's designed to read in lots of files and then using a dashboard, display the data in a meaningful way.

As such I'm attempting to do Index-time field extraction, as I want the displays to be as fast as possible for the end user. I've tried this a thousand ways and I can't get it working 😞

All of the data is in XML format, and a large chunk of it features multiple field values, which is where I'm getting stuck. I can extract multi-valued fields with no problem using REX, but it seems to refuse to do it using the config files. I've compiled the following example to show you what I mean, I've just done it with one file, but I'm having the same problem with all files I'm pulling in:

props.conf

[nessus]
SHOULD_LINEMERGE = False
LINE_BREAKER = (?<=</ReportHost>)([\r\n]+)
TRUNCATE = 0
TRANSFORMS-nessus_high_vulnerbility = nessus_high_vulnerbility

transforms.conf

[nessus_high_vulnerbility]
REGEX = <ReportItem.*severity=\"3\".*pluginName=\"([^"]+)\"
FORMAT = nessus_high_vulnerbility::"$1"
LOOKAHEAD = 10000000000
WRITE_META = true
REPEAT_MATCH = true

fields.conf

[nessus_high_vulnerbility]
INDEXED = true

Example data

<Report name="1.1.1.1">
<ReportHost name="1.1.1.1"><HostProperties>
<tag name="HOST_END">Tue Nov 22 12:06:01 2011</tag>
<tag name="system-type">general-purpose</tag>
<tag name="operating-system">Linux Kernel 2.6.9-101.ELsmp on Red Hat Enterprise Linux ES release 4 (Nahant Update 9)</tag>
<tag name="mac-address">00:00:00:00:00:00</tag>
<ReportItem port="1234" svc_name="snmp?" protocol="udp" severity="3" pluginID="51160" pluginName="SNMP Agent Default Community Name (public)" pluginFamily="SNMP">
</ReportItem>
<ReportItem port="0" svc_name="general" protocol="tcp" severity="3" pluginID="21157" pluginName="Unix Compliance Checks" pluginFamily="Policy Compliance">
</ReportItem>
</ReportHost>
</Report>

Now if I search for * it tells me that the "nessus_high_vulnerbility" field has one result.

But if I do the following search, the "high_vulnerbility" field has 2 results, the correct number.

* | rex "\<ReportItem.*severity=\"3\".*pluginName=\"(?<high_vulnerbility>[^\"]+)\"" max_match=100000

I've tried everything I can think of, been through the documentation a hundred times, and still can't figure it out. Please help!

(PS, apologies if the above doesn't come out right, I'm struggling with getting Markdown to play nicely with the pasted code)

Tags (2)

ahattrell_splun
Splunk Employee
Splunk Employee

Interestingly, when I replace your transform.conf with the following:


[getPlugin]
REGEX = severity=\"3\".*?pluginName=\"([^\"]+)
FORMAT = pluginName::$1
MV_ADD = true

I get a multi-value field pluginName with 2 values SNMP & UNIX. So it's something to do with the extended regex you're using.

To be honest, I'd be hesitant to use the Regex to filter data, instead I'd aim to add all the fields and then filter using Splunks native search capabilities. You never know when you might need to search using different criteria and by hard coding your results you limit that flexibility.

As an aside - the xml as written is broken. The HostProperties tag doesn't seem to be closed.

Ayn
Legend

First off, I highly doubt you really want to use index-time field extractions unless you really really know what you are doing and why. Index-time extractions will in fact most often decrease performance rather than increase it. Indexed fields do not work the same way as they do in traditional RDBMS's - if you're trying to apply that kind of thinking in Splunk, that's wrong. Use search-time field extractions - the performance is better and it makes Splunk's behaviour less confusing and more flexible. So, I would advise you to change your TRANSFORMS directive in props.conf to a REPORT directive instead.

That said, I think the issue here is that Splunk will match your regex only once unless you specify MV_ADD = true, which makes Splunk continue looking for matches in the event even after it's found the first one. MV_ADD is only valid for search-time extractions, so you should consider using that kind instead...did I make myself clear enough on what kind of extraction you should be using? 😉

As a sidenote, I'm assuming you've seen that there's a Nessus app for Splunk? Don't know if it supports the XML report format though. http://splunk-base.splunk.com/apps/52460/nessus-in-splunk

gkanapathy
Splunk Employee
Splunk Employee

would you mind changing the .* in your regex to a non-greedy matching .*? and see if that make a difference?

0 Karma

jonaubrey
Explorer

I've changed the extraction back to a Index time extraction and run "walklex" against the Index. This is showing only a single value instead of multiple values within the index, so something definately isn't getting pulled out right.

Interestingly the TRANSFORMS extraction pulls out the value "SNMP Agent Default Community Name (public)" and the REPORT extraction pulls out the value "Unix Compliance Checks" even though it's the same REGEX. I guess Splunk is discarding all but one entry but depending on if it's a search-time or index-time extraction, it's either keeping the first or last entry

0 Karma

jonaubrey
Explorer

Great idea!

Unfortunately, it didn't change anything, I'm still only extracting a single value

0 Karma

Ayn
Legend

Ah! One idea - since the information is represented as key=value pairs, you might be hitting some issues with Splunk's default key=value extraction mechanism. Basically Splunk tries to be smart about generating field and corresponding values automatically when it sees stuff delimited by = signs, putting the lefthand side as the fieldname and the righthand side as the value. This extraction does not have MV_ADD = true I believe.

Try setting KV_MODE = none in your props.conf settings. This will ensure that automatic key/value extraction is not performed for that stanza.

0 Karma

jonaubrey
Explorer

Same as before, the REX extraction pulls out two values, but the transforms extraction only pulls out a single value, "Unix Compliance Checks".

0 Karma

Ayn
Legend

That looks pretty OK. What are the current results?

btw, WRITE_META is only valid for index-time extractions. As such it should just be ignored in your current config anyway, but just to simplify things you might as well remove it.

0 Karma

jonaubrey
Explorer

Props.conf

[nessus]
SHOULD_LINEMERGE = False
LINE_BREAKER = (?<=</ReportHost>)([\r\n]+)
TRUNCATE = 0
REPORT-nessus_high_vulnerbility = nessus_high_vulnerbility

transforms.conf

[nessus_high_vulnerbility]
REGEX = <ReportItem.*severity=\"3\".*pluginName=\"([^"]+)\"
FORMAT = nessus_high_vulnerbility::"$1"
LOOKAHEAD = 10000000000
WRITE_META = true    
MV_ADD = true

Any other ideas?

0 Karma

jonaubrey
Explorer

Many thanks for your reply.

Yes I've seen the "Nessus In Splunk" app, but it relies on non-XML format, and I'm attempting to standardise all of the outputs from various tools to a single format, XML being the most common.

Unfortunately I was attempting to do search-time extractions previously and it failed, which is why I swapped over to Index-time extractions.

Just to confirm, I've changed my config files back to how I had them originally, but with the same result. I've posted them below:

0 Karma

Ayn
Legend

I fixed your formatting a bit - please check that it came out as you originally intended. Code blocks should be indented with 4 spaces at the beginning of the line in order to be correctly interpreted.

Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...