Hi all,
I've been struggling with Splunk for weeks now (and had Developer training!) and I still can't get it to do what I want it to do, so here begins the first of many questions....
I'm attempting to build an app that does a single parse of some static data. Basically it's designed to read in lots of files and then using a dashboard, display the data in a meaningful way.
As such I'm attempting to do Index-time field extraction, as I want the displays to be as fast as possible for the end user. I've tried this a thousand ways and I can't get it working 😞
All of the data is in XML format, and a large chunk of it features multiple field values, which is where I'm getting stuck. I can extract multi-valued fields with no problem using REX, but it seems to refuse to do it using the config files. I've compiled the following example to show you what I mean, I've just done it with one file, but I'm having the same problem with all files I'm pulling in:
props.conf
[nessus]
SHOULD_LINEMERGE = False
LINE_BREAKER = (?<=</ReportHost>)([\r\n]+)
TRUNCATE = 0
TRANSFORMS-nessus_high_vulnerbility = nessus_high_vulnerbility
transforms.conf
[nessus_high_vulnerbility]
REGEX = <ReportItem.*severity=\"3\".*pluginName=\"([^"]+)\"
FORMAT = nessus_high_vulnerbility::"$1"
LOOKAHEAD = 10000000000
WRITE_META = true
REPEAT_MATCH = true
fields.conf
[nessus_high_vulnerbility]
INDEXED = true
Example data
<Report name="1.1.1.1">
<ReportHost name="1.1.1.1"><HostProperties>
<tag name="HOST_END">Tue Nov 22 12:06:01 2011</tag>
<tag name="system-type">general-purpose</tag>
<tag name="operating-system">Linux Kernel 2.6.9-101.ELsmp on Red Hat Enterprise Linux ES release 4 (Nahant Update 9)</tag>
<tag name="mac-address">00:00:00:00:00:00</tag>
<ReportItem port="1234" svc_name="snmp?" protocol="udp" severity="3" pluginID="51160" pluginName="SNMP Agent Default Community Name (public)" pluginFamily="SNMP">
</ReportItem>
<ReportItem port="0" svc_name="general" protocol="tcp" severity="3" pluginID="21157" pluginName="Unix Compliance Checks" pluginFamily="Policy Compliance">
</ReportItem>
</ReportHost>
</Report>
Now if I search for * it tells me that the "nessus_high_vulnerbility"
field has one result.
But if I do the following search, the "high_vulnerbility"
field has 2 results, the correct number.
* | rex "\<ReportItem.*severity=\"3\".*pluginName=\"(?<high_vulnerbility>[^\"]+)\"" max_match=100000
I've tried everything I can think of, been through the documentation a hundred times, and still can't figure it out. Please help!
(PS, apologies if the above doesn't come out right, I'm struggling with getting Markdown to play nicely with the pasted code)
Interestingly, when I replace your transform.conf with the following:
[getPlugin]
REGEX = severity=\"3\".*?pluginName=\"([^\"]+)
FORMAT = pluginName::$1
MV_ADD = true
I get a multi-value field pluginName with 2 values SNMP & UNIX. So it's something to do with the extended regex you're using.
To be honest, I'd be hesitant to use the Regex to filter data, instead I'd aim to add all the fields and then filter using Splunks native search capabilities. You never know when you might need to search using different criteria and by hard coding your results you limit that flexibility.
As an aside - the xml as written is broken. The HostProperties tag doesn't seem to be closed.
First off, I highly doubt you really want to use index-time field extractions unless you really really know what you are doing and why. Index-time extractions will in fact most often decrease performance rather than increase it. Indexed fields do not work the same way as they do in traditional RDBMS's - if you're trying to apply that kind of thinking in Splunk, that's wrong. Use search-time field extractions - the performance is better and it makes Splunk's behaviour less confusing and more flexible. So, I would advise you to change your TRANSFORMS directive in props.conf to a REPORT directive instead.
That said, I think the issue here is that Splunk will match your regex only once unless you specify MV_ADD = true
, which makes Splunk continue looking for matches in the event even after it's found the first one. MV_ADD
is only valid for search-time extractions, so you should consider using that kind instead...did I make myself clear enough on what kind of extraction you should be using? 😉
As a sidenote, I'm assuming you've seen that there's a Nessus app for Splunk? Don't know if it supports the XML report format though. http://splunk-base.splunk.com/apps/52460/nessus-in-splunk
would you mind changing the .*
in your regex to a non-greedy matching .*?
and see if that make a difference?
I've changed the extraction back to a Index time extraction and run "walklex" against the Index. This is showing only a single value instead of multiple values within the index, so something definately isn't getting pulled out right.
Interestingly the TRANSFORMS extraction pulls out the value "SNMP Agent Default Community Name (public)" and the REPORT extraction pulls out the value "Unix Compliance Checks" even though it's the same REGEX. I guess Splunk is discarding all but one entry but depending on if it's a search-time or index-time extraction, it's either keeping the first or last entry
Great idea!
Unfortunately, it didn't change anything, I'm still only extracting a single value
Ah! One idea - since the information is represented as key=value pairs, you might be hitting some issues with Splunk's default key=value extraction mechanism. Basically Splunk tries to be smart about generating field and corresponding values automatically when it sees stuff delimited by =
signs, putting the lefthand side as the fieldname and the righthand side as the value. This extraction does not have MV_ADD = true
I believe.
Try setting KV_MODE = none
in your props.conf
settings. This will ensure that automatic key/value extraction is not performed for that stanza.
Same as before, the REX extraction pulls out two values, but the transforms extraction only pulls out a single value, "Unix Compliance Checks".
That looks pretty OK. What are the current results?
btw, WRITE_META
is only valid for index-time extractions. As such it should just be ignored in your current config anyway, but just to simplify things you might as well remove it.
Props.conf
[nessus]
SHOULD_LINEMERGE = False
LINE_BREAKER = (?<=</ReportHost>)([\r\n]+)
TRUNCATE = 0
REPORT-nessus_high_vulnerbility = nessus_high_vulnerbility
transforms.conf
[nessus_high_vulnerbility]
REGEX = <ReportItem.*severity=\"3\".*pluginName=\"([^"]+)\"
FORMAT = nessus_high_vulnerbility::"$1"
LOOKAHEAD = 10000000000
WRITE_META = true
MV_ADD = true
Any other ideas?
Many thanks for your reply.
Yes I've seen the "Nessus In Splunk" app, but it relies on non-XML format, and I'm attempting to standardise all of the outputs from various tools to a single format, XML being the most common.
Unfortunately I was attempting to do search-time extractions previously and it failed, which is why I swapped over to Index-time extractions.
Just to confirm, I've changed my config files back to how I had them originally, but with the same result. I've posted them below:
I fixed your formatting a bit - please check that it came out as you originally intended. Code blocks should be indented with 4 spaces at the beginning of the line in order to be correctly interpreted.