Splunk Search

fields.conf TOKENIZER breaks my event completely

dsmith
Path Finder

I'm trying to get a new sourcetype (NetApp user-level audit logs, exported as XML) to work, and I think my fields.conf tokenizer is breaking things. But I'm not really sure how, or why, or what to do about it.

The raw data is XML, but I'm not using KV_MODE=xml because that doesn't properly handle all the attributes. So, I've got a bunch of custom regular expressions, the true backbone of all enterprise software. 🙂 Here's a single sample event (but you can probably disregard most of it, it's just here for completeness):

<Event><System><Provider Name="NetApp-Security-Auditing" Guid="{guid-edited}"/><EventID>4656</EventID><EventName>Open Object</EventName><Version>101.3</Version><Source>CIFS</Source><Level>0</Level><Opcode>0</Opcode><Keywords>0x8020000000000000</Keywords><Result>Audit Success</Result><TimeCreated SystemTime="2022-01-12T15:42:41.096809000Z"/><Correlation/><Channel>Security</Channel><Computer>server-name-edited</Computer><ComputerUUID>guid-edited</ComputerUUID><Security/></System><EventData><Data Name="SubjectIP" IPVersion="4">1.2.3.4</Data><Data Name="SubjectUnix" Uid="1234" Gid="1234" Local="false"></Data><Data Name="SubjectUserSid">S-1-5-21-3579272529-1234567890-2280984729-123456</Data><Data Name="SubjectUserIsLocal">false</Data><Data Name="SubjectDomainName">ACCOUNTS</Data><Data Name="SubjectUserName">davidsmith</Data><Data Name="ObjectServer">Security</Data><Data Name="ObjectType">Directory</Data><Data Name="HandleID">00000000000444;00;002a62a7;0d3d88a4</Data><Data Name="ObjectName">(Shares);/LogTestActivity/dsmith/wordpress-shared/plugins-shared</Data><Data Name="AccessList">%%4416 %%4423 </Data><Data Name="AccessMask">81</Data><Data Name="DesiredAccess">Read Data; List Directory; Read Attributes; </Data><Data Name="Attributes">Open a directory; </Data></EventData></Event>

My custom app's props.conf has a couple dozen lines like this, for each element I want to be able to search on:

EXTRACT-DesiredAccess = <Data Name="DesiredAccess">(?<DesiredAccess>.*?)<\/Data>
EXTRACT-HandleID = <Data Name="HandleID">(?<HandleID>.*?)<\/Data>
EXTRACT-InformationRequested = <Data Name="InformationRequested">(?<InformationRequested>.*?)<\/Data>

This works as you'd expect, except for a couple of fields where they're composites. This is most noticeable in the DesiredAccess element, which in our example looks like:

<Data Name="DesiredAccess">Read Data; List Directory; Read Attributes; </Data>

Thus you get a single field with "Read Data; List Directory; Read Attributes; " and if you only need to look for, say, "List Directory," you have to get clever with your searches.

So, I added a fields.conf file with this in it:

[DesiredAccess]
TOKENIZER = \s?(.*?);

When I paste the 'raw' contents of that field, and that regex, into a tool like regex101.com, it works and returns the expected results. Similarly, it also works if I remove it from fields.conf, and put it in as a makemv command:

index=nonprod_pe | makemv tokenizer="\s?(.*?);" DesiredAccess

With the TOKENIZER element in fields.conf, the DesiredAccess attribute just doesn't populate, period. So I assume it's the problem.

(Since this is in an app, the app's metadata does contain explicit "export = system" lines for both [props] and [fields]. And the app is on indexers and search heads. Probably doesn't need to be in both places, but hey I'm still learning...)

So, what am I doing wrong with my fields.conf tokenizer, that's caused it to fail completely to identify any elements?

Labels (1)
Tags (2)
0 Karma

diogofgm
SplunkTrust
SplunkTrust

Have you tried using transforms? you might want to give this a try this:

transforms.conf

 

[extract_xml_data_atribute_as_field]
REGEX=<Data Name="([^"]*)"[^>]*>([^<]*)
FORMAT=$1::$2

[extract_xml_data_values_list_as_mv]
SOURCE_KEY = DesiredAccess
REGEX = (?<DesiredAccessList>[^;]*);
MV_ADD=true

 

 

props.conf

 

[<your_sourcetype>]
REPORT-xml_data_to_field = extract_xml_data_atribute_as_field, extract_xml_data_values_list_as_mv

 



------------
Hope I was able to help you. If so, some karma would be appreciated.
0 Karma

dsmith
Path Finder

What benefits would there be to a transforms.conf approach over fields.conf? I'm still fairly new to Splunk, and definitely new to this sort of data massaging, so I don't deeply understand the pros and cons of each.

0 Karma

diogofgm
SplunkTrust
SplunkTrust

From your last reply you stated that the other solution you end up with was close enough. 🙂 This solution is better than close enough. Also with this you avoid having the extra makemv command in your search because with this transforms the field is already extracted as a mv field.

------------
Hope I was able to help you. If so, some karma would be appreciated.
0 Karma

yuanliu
SplunkTrust
SplunkTrust

It is unadvisable to handle structured data with custom regex because such is fraught with pitfalls.  It is better to focus on why KV_MODE=xml "doesn't properly handle all the attributes."  Generally speaking,  there is no reason why vendor's tested builtin function cannot handle conformant data.

Can you illustrate with cleansed data where indexer/spath isn't handling correctly?

0 Karma

dsmith
Path Finder

KV_MODE=xml doesn't handle most of the <Data Name="fieldname">value</Data> events, in the way that I would hope/expect. You'll get an attribute named literally "Name" but not something named "fieldname" with a value of "value". The most egregious example in terms of practicality is probably:

<Data Name="SubjectUserName">davidsmith</Data>

In the above, I would like an event attribute named "SubjectUserName" with a value of "davidsmith". (Yes, I want user names in my audit logs...) But neither KV_MODE=xml, nor |xmlkv in a search, handle this case properly. (Or at least "the way I want them to," which may or may not be "properly.")

NetApp's particular flavor of XML has been an issue for years:

https://community.splunk.com/t5/Getting-Data-In/How-to-configure-Splunk-to-index-NetApp-CIFS-logs-in...

https://community.splunk.com/t5/Dashboards-Visualizations/Parsing-oddly-formatted-XML-NetApp-log/m-p...

Not that this is relevant, because the specific elements I'm asking about in this topic, such as DesiredAccess, aren't parsed properly either. 🙂

I'm primarily interested in understanding why my fields.conf tokenizers aren't working, not so much in debugging Splunk's internal XML parser.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

KV_MODE=xml is perhaps the wrong option for this problem.  On the other hand, spath command can put attributes into field names with the  {@attrib} notation so you don't get field name like "Name"; instead, you get a scalar facsimile of the vectorial attribute space, like Event.EventData.Data{@Name}Event.System.Provider{@Name}, and so on.  Like any reduction of dimensions, spath ends up losing some information. (Another problem - I consider it a bug, is that spath does not handle empty values correctly.)  But because XML follows an application-specific DTD, you can usually compensate with application-specific handling, like the following:

 

| rex mode=sed "s/><\/Data/>()<\//g" ``` compensate for spath's inability to handle empty values ```
| spath
| rename Event.EventData.Data{@*} as EventData*, Event.EventData.Data as EventDataData ``` most eval functions cannot handle {} notation ```
| eval EventDataName=mvmap(EventDataName, case(EventDataName == "SubjectUnix", "SubjectUnix <Uid:" . EventDataUid . ", Gid:" . EventDataGid . ", Local:" . EventDataLocal . ">", EventDataName == "SubjectIP", "SubjectIP<" . EventDataIPVersion . ">", true(), EventDataName)) ``` application-specific mapping ```
| eval Combo = mvzip(EventDataName, EventDataData, "=")

 

(See inline comments)  Output from your sample data is

Combo
Event.System.ChannelEvent.System.ComputerEvent.System.ComputerUUIDEvent.System.EventIDEvent.System.EventNameEvent.System.KeywordsEvent.System.LevelEvent.System.OpcodeEvent.System.Provider{@Guid}Event.System.Provider{@Name}Event.System.ResultEvent.System.SourceEvent.System.TimeCreated{@SystemTime}Event.System.Version
EventDataData
EventDataGidEventDataIPVersionEventDataLocal
EventDataName
EventDataUid_raw_time
SubjectIP<4>=1.2.3.4
SubjectUnix <Uid:1234, Gid:1234, Local:false>=()
SubjectUserSid=S-1-5-21-3579272529-1234567890-2280984729-123456
SubjectUserIsLocal=false
SubjectDomainName=ACCOUNTS
SubjectUserName=davidsmith
ObjectServer=Security
ObjectType=Directory
HandleID=00000000000444;00;002a62a7;0d3d88a4
ObjectName=(Shares);/LogTestActivity/dsmith/wordpress-shared/plugins-shared
AccessList=%%4416 %%4423
AccessMask=81
DesiredAccess=Read Data; List Directory; Read Attributes;
Attributes=Open a directory;
Securityserver-name-editedguid-edited4656Open Object0x802000000000000000{guid-edited}NetApp-Security-AuditingAudit SuccessCIFS2022-01-12T15:42:41.096809000Z101.3
1.2.3.4
()
S-1-5-21-3579272529-1234567890-2280984729-123456
false
ACCOUNTS
davidsmith
Security
Directory
00000000000444;00;002a62a7;0d3d88a4
(Shares);/LogTestActivity/dsmith/wordpress-shared/plugins-shared
%%4416 %%4423
81
Read Data; List Directory; Read Attributes;
Open a directory;
12344false
SubjectIP<4>
SubjectUnix <Uid:1234, Gid:1234, Local:false>
SubjectUserSid
SubjectUserIsLocal
SubjectDomainName
SubjectUserName
ObjectServer
ObjectType
HandleID
ObjectName
AccessList
AccessMask
DesiredAccess
Attributes
1234<Event><System><Provider Name="NetApp-Security-Auditing" Guid="{guid-edited}"/><EventID>4656</EventID><EventName>Open Object</EventName><Version>101.3</Version><Source>CIFS</Source><Level>0</Level><Opcode>0</Opcode><Keywords>0x8020000000000000</Keywords><Result>Audit Success</Result><TimeCreated SystemTime="2022-01-12T15:42:41.096809000Z"/><Correlation/><Channel>Security</Channel><Computer>server-name-edited</Computer><ComputerUUID>guid-edited</ComputerUUID><Security/></System><EventData><Data Name="SubjectIP" IPVersion="4">1.2.3.4</Data><Data Name="SubjectUnix" Uid="1234" Gid="1234" Local="false">()</><Data Name="SubjectUserSid">S-1-5-21-3579272529-1234567890-2280984729-123456</Data><Data Name="SubjectUserIsLocal">false</Data><Data Name="SubjectDomainName">ACCOUNTS</Data><Data Name="SubjectUserName">davidsmith</Data><Data Name="ObjectServer">Security</Data><Data Name="ObjectType">Directory</Data><Data Name="HandleID">00000000000444;00;002a62a7;0d3d88a4</Data><Data Name="ObjectName">(Shares);/LogTestActivity/dsmith/wordpress-shared/plugins-shared</Data><Data Name="AccessList">%%4416 %%4423 </Data><Data Name="AccessMask">81</Data><Data Name="DesiredAccess">Read Data; List Directory; Read Attributes; </Data><Data Name="Attributes">Open a directory; </Data></EventData></Event>2022-01-12T15:42:41
In the above, Combo field is a scalar representation of <Event><EventData><Data> entities, using Event.EventData.Data{@Name} as the primary attribute.  As you can see, SubjectUserName=davidsmith is one of the values in Combo.
Tags (3)
0 Karma

yuanliu
SplunkTrust
SplunkTrust
@yuanliu wrote:

KV_MODE=xml is perhaps the wrong option for this problem.  On the other hand, spath command


I didn't look deeply enough.  In fact, KV_MODE=XML performs spath just like in explicit SPL.  It could have worked if not for the want of a placeholder value when Event.EventData.Data contains null values.  In explicit spath, I try to fix this bug with "s/><\/Data/>()<\//g" before running spath.  But  there is no way to fix implicit output.

0 Karma

dsmith
Path Finder

Which is great but doesn't address the question I'm asking. Note that the DesiredAccess attribute still is shown as a single text item, and isn't being tokenized into its individual components.

Assume for the sake of this discussion that I'm going to stick with regexes for now. I have working regular expressions for the fields I care about, and as long as I don't also have a tokenizer for those fields, the field extraction works. But when I add fields.conf the fields named therein aren't extracted, period. Any suggestions on what I'm doing wrong there?

0 Karma

yuanliu
SplunkTrust
SplunkTrust

The spath code is just to illustrate how to clean up.  Key-value pairs in Combo can be extracted using extract command (aka kv).

 

 

 

| spath
| rename Event.EventData.Data{@*} as EventData*, Event.EventData.Data as EventDataData ``` most eval functions cannot handle {} notation ```
| eval EventDataName=mvmap(EventDataName, case(EventDataName == "SubjectUnix", "SubjectUnix <Uid:" . EventDataUid . ", Gid:" . EventDataGid . ", Local:" . EventDataLocal . ">", EventDataName == "SubjectIP", "SubjectIP<" . EventDataIPVersion . ">", true(), EventDataName)) ``` application-specific mapping ```
| eval Combo = mvzip(EventDataName, EventDataData, "=\"")
| rename Combo as _raw
| rex mode=sed "s/$/\"/"
| kv kvdelim="=" ``` extract key-value pairs from Combo ```
| fields - Event*, _raw
| makemv delim=";" DesiredAccess
| makemv delim=";" Attributes
| makemv delim=";" HandleID
| makemv delim=";" ObjectName

 

 

Sample output is like

AccessListAccessMaskAttributes
DesiredAccess
HandleIDObjectNameObjectServerObjectTypeSubjectDomainNameSubjectUserIsLocalSubjectUserNameSubjectUserSid_time
%%4416 %%442381Open a directory
Read Data
List Directory
Read Attributes
00000000000444
00
002a62a7
0d3d88a4

00;002a62a7;0d3d88a4

(Shares)
/LogTestActivity/dsmith/wordpress-shared/plugins-shared
SecurityDirectoryACCOUNTSfalsedavidsmithS-1-5-21-3579272529-1234567890-2280984729-1234562022-01-12T15:42:41

The main point is that structured data are best  handled with conformant tested code.  In addition, complex, custom index-time extraction makes maintenance difficult.  Search-time prowess is Splunk's very strength.  Why not use it?

Meantime, the error in fields.conf is that TOKENIZER does not accept extra characters outside the token itself. This should work:

 

 

 

[DesiredAccess]
TOKENIZER = (\b[^;]+)

 

 

 

 

0 Karma

dsmith
Path Finder

Well, at least that updated tokenizer breaks things in a different way... 

I edited the fields.conf I'm pushing out to my search heads thusly:

[DesiredAccess]
# TOKENIZER = \s?(.*?);
TOKENIZER = (\b[^;]+)

The contents of the fields so tokenized (is that a word?) at least show up when I expand a given search result now. They're a single line, with the semicolons removed. (I highlighted multiple lines because there are actually about a half-dozen such fields that I'm extracting, I limited it to a single instance for this thread because the solution for one should be identical to all the others.)

Screenshot of a single event, with the improperly-extracted fields highlighted.Screenshot of a single event, with the improperly-extracted fields highlighted.

 

Your regex works correctly in online tools like regex101.com, but then again so did mine. (Yours is cleaner and faster, though, so thank you for that.) I wish Splunk had more and better examples of how to use TOKENIZER in the docs.

 

Dumb Newbie Question of the day: The fields are split correctly if I remove the tokenizer, and add | makemv delim=";" FieldNameHere to a search. Is there a way to add that to a config file? (i.e. "every time you search this sourcetype, do this" or similar) Part of my goal here is to make life easier for users that aren't deeply familiar with Splunk field commands, and asking these users to add a half-dozen makemv commands to every search isn't exactly convenient for anyone involved.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

The contents of the fields so tokenized (is that a word?) at least show up when I expand a given search result now. They're a single line, with the semicolons removed. (I highlighted multiple lines because there are actually about a half-dozen such fields that I'm extracting, I limited it to a single instance for this thread because the solution for one should be identical to all the others.)

Screenshot of a single event, with the improperly-extracted fields highlighted.Screenshot of a single event, with the improperly-extracted fields highlighted.

 

Maybe you can elaborate "breaks things in a different way... "  You are correct that values looks to be on a single line IF you just click expand the even view.  But that look itself doesn't mean much.  Based on your original question, your intention is to break DesiredAccess, etc., into a multivalue field instead of semicolon-separated single string.  The proposed TOKENIZER does exactly that.  How is this  broken?  You can count the number of values of the DesiredAccess like this

 

| eval AccessCount=mvcount(DesiredAccess)

 

You'll see that the count is > 1.  I ingested your sample data, then used the following props.properties

 

[xml-too_small]
EXTRACT-DesiredAccess = <Data Name="DesiredAccess">(?<DesiredAccess>.*?)<\/Data>
EXTRACT-HandleID = <Data Name="HandleID">(?<HandleID>.*?)<\/Data>
EXTRACT-InformationRequested = <Data Name="InformationRequested">(?<InformationRequested>.*?)<\/Data>
EXTRACT-Attributes = <Data Name="Attributes">(?<Attributes>[^<]*)<\/Data>

 

and fields.properties

 

[DesiredAccess]
TOKENIZER = (\b[^;]+)
[ObjectName]
TOKENIZER = (\b[^;]+)
[InformationRequested]
TOKENIZER = (\b[^;]+)
[Attributes]
TOKENIZER = (\b[^;]+)
[HandleID]
TOKENIZER = (\b[^;]+)

 

When I perform this search

 

index="tests" source="netapptest.xml"
| table DesiredAccess Attributes HandleID ObjectName
| eval AccessCount=mvcount(DesiredAccess)
| eval ObjectCount=mvcount(ObjectName)
| eval HandleCount=mvcount(HandleID)

 

it gives

DesiredAccess
Attributes
HandleID
ObjectName
AccessCountHandleCountObjectCount
Read Data
List Directory
Read Attributes
Open a directory
00000000000444
00
002a62a7
0d3d88a4
Shares)
LogTestActivity/dsmith/wordpress-shared/plugins-shared
342

So, even though the expanded event view displays these fields in a single line, they are really multivalue fields now; DesiredAccess, for example, is made of 3 distinct values. (Do not test this in verbose mode.  That mode can interact strangely.)  This is exactly what TOKENIZER does, and I believe that this is what you originally wanted.

The "clipping" of the opening parenthesis in ObjectName highlights the reason why I strongly recommend using vendor-provided commands like spath.   You can fine tune that TOKENIZER  to get around this one  problem, but there maybe other data values to break it.

So, I refined the spath method to eliminate glitches when there are multiple attributes in one property:

 

| rex mode=sed "s/><\/Data/>()<\//g" ``` compensate for spath's inability to handle empty values ```
| spath
| rename Event.EventData.Data{@*} as EventData*, Event.EventData.Data as EventDataData ``` most eval functions cannot handle {} notation ```
| eval EventDataName=mvmap(EventDataName, case(EventDataName == "SubjectUnix", "SubjectUnix <Uid:" . EventDataUid . ", Gid:" . EventDataGid . ", Local:" . EventDataLocal . ">", EventDataName == "SubjectIP", "SubjectIP<" . EventDataIPVersion . ">", true(), EventDataName)) ``` application-specific mapping ```
| eval Combo = mvzip(EventDataName, EventDataData, "=\"")
| eval Combo = mvmap(Combo, replace(Combo, "<(.+)>=\"", "=\"<\1>")) ``` handle multi-attribute properties ```
| rename Combo as _raw
| rex mode=sed "s/$/\"/"
| kv kvdelim="=" ``` extract key-value pairs from Combo ```
| fields - Event*, _raw
| makemv delim=";" DesiredAccess
| makemv delim=";" Attributes
| makemv delim=";" HandleID
| makemv delim=";" ObjectName

 

(Note: The above will not work correctly when that custom TOKENIZER exists.)  This is a lot more generic in terms of which parts of XML turn into fields.  The output of the above for your sample data is

 
_timeAccessListAccessMaskAttributes
DesiredAccess
HandleID
ObjectName
ObjectServerObjectTypeSubjectDomainNameSubjectIPSubjectUnixSubjectUserIsLocalSubjectUserNameSubjectUserSid
2022-01-21 19:38:25%%4416 %%442381Open a directory
Read Data
List Directory
Read Attributes
00000000000444
00
002a62a7
0d3d88a4
(Shares)
/LogTestActivity/dsmith/wordpress-shared/plugins-shared
SecurityDirectoryACCOUNTS<4>1.2.3.4<Uid:1234, Gid:1234, Local:false>()falsedavidsmithS-1-5-21-3579272529-1234567890-2280984729-123456
Not only are DesiredAccess, ObjectName, etc., multivalued, and the first value of ObjectName is no longer missing opening parenthesis, but SubjectIP now shows with Version embedded in value, so does SubjectUnix.
0 Karma

dsmith
Path Finder

I replaced the tokenizer for my desired fields with

TOKENIZER = (\s?(.*?);)

It's close-enough for my case. The tokenized events still have the semicolon in their name, but I can live with that for now. (I tried (\s?(.*?)); but then all the event names were empty strings.)

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...