Security

Need a generic way to handle comma-separated values as multi-value fields during field extraction

asees
Engager

I’m working with CEF logs in Splunk where some fields contain comma-separated values.

Goal

Find a generic solution so that any field containing comma-separated values is automatically treated as a true multi-value field during field extraction — without needing to define each field name individually in props.conf file.

Example event:
CEF:0|vendor|product|1.0||||dst_ip=172.18.20.16,172.18.20.12,172.18.20.13,172.18.20.10|src_ip=10.1.1.1,10.1.1.2|user_list=alice,bob,charlie|error_codes=ERR101,ERR102|app_names=Splunk,ServiceNow,Elastic|location=datacenter-1|priority=high|status=open


Current config

1. props.conf:


[my:sourecetype]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
REPORT-generic_field_extraction = generic_key_value_extraction
EVAL-dst_ip = split(dst_ip, ",")
EVAL-src_ip = split(src_ip, ",")
EVAL-user_list = split(user_list, ",")
EVAL-error_codes = split(error_codes, ",")
EVAL-app_names = split(app_names, ",")

 

2. transforms.conf:


[generic_key_value_extraction]
REGEX = (?<_KEY_1>[^=|]+)=(".*?"|[^|]+)
FORMAT = $1::$2
MV_ADD = true

 

Labels (1)
0 Karma

asees
Engager

Below is the complete log:

CEF:0|Honeywell|CyberPredict|1.0||||dst_ip=172.18.30.21,172.18.30.22,172.18.30.23|src_ip=10.10.10.1,10.10.10.2|user_list=alice,bob,charlie|error_codes=ERR201,ERR202|app_names=Splunk,ServiceNow,Elastic|location=datacenter-east|priority=critical|status=active|a.b.1.id=B1|a.b.2.id=B2|a.b.2.type=network|a.b.1.status.online=yes|a.b.3.id=B3|a.b.3.status.online=no


The dot (.) notation in field names represents hierarchical or nested data structures, as shown below in the JSON format:

{
"a": {
"b": [
{
"id": "B1",
"type": "network",
"status": {
"online": "yes"
}
},
{
"id": "B2",
"type": "application",
"status": {
"online": "yes"
}
},
{
"id": "B3",
"type": "endpoint",
"status": {
"online": "no"
}
}
]
}
}
0 Karma

PickleRick
SplunkTrust
SplunkTrust

I see CEF I cry 😉

But seriously - instead of extracting or calculating the values (for which you will have to provide the names), you can use the TOKENIZER functionality in fields.conf

The pro is that fields.conf entries accepts wildcards.

The con is that wildcards do their job and get all matching fields so if you define a tokenizer for _all_ fields in your sourcetype, it will be splitting all fields and there's no way to exclude specific fields.

See https://docs.splunk.com/Documentation/Splunk/latest/Admin/Fieldsconf#.5B.26lt.3Bfield_name.26gt.3B.7... for more info

asees
Engager

@PickleRick   
The Splunk is not able to parse multi-value items, all the comma separated values are being parsed as a single value.
I have used this new configs:

1. props.conf:

[my:sourcetype1]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
REPORT-generic_field_extraction = generic_key_value_extraction

[my:sourcetype2]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
REPORT-generic_field_extraction = generic_key_value_extraction

2. transforms.conf:

[generic_key_value_extraction]
# Dynamically capture ANY key=value pair, allowing commas inside the value
# The value ends ONLY when a pipe "|" or end of line is reached
REGEX = (?<_KEY_1>[^=|]+)=(".*?"|[^|]+)
FORMAT = $1::$2
# Allow multiple matches for same key
MV_ADD = true

3. fields.conf:

# Apply to ALL fields for both sourcetypes
[*,sourcetype::my:sourcetype1,sourcetype::mysourcetype2]
TOKENIZER = ([^,]+)

0 Karma

PickleRick
SplunkTrust
SplunkTrust
[*,sourcetype::my:sourcetype1,sourcetype::mysourcetype2]

Wrong syntax

[sourcetype::my_sourcetype1::*]

if you want all fields for my_sourcetype1 (you can't wildcard the sourcetype itself).

0 Karma

asees
Engager

@PickleRick I have tried using this syntax,

[sourcetype::my_sourcetype1::*]

 

But still it is not working for me.

Below is my fields.conf

[sourcetype::my_sourcetype1::*]
TOKENIZER = ([^,]+)



0 Karma

PickleRick
SplunkTrust
SplunkTrust

You might want to call out to support team. The general functionality is there but it seems to be sensitive to some undocummented stuff.

1. The sourcetype-based definitions are supposed to work (and Splunk by default has definitions for [sourcetype::splunk_resource_usage::data*] so either it's a long-standing not working example which they've been shipping for a loooong time and not notice because handling indexed fields improved over the years)

2. Even if I define my TOKENIZER for a field specified by general name, not sourcetype-bound, it seems to sometimes work, sometimes not.

Example - my data contains job logs from Bareos. Events contain a line (they are multiline)

Volume name(s): vchanger-1_1_0002|vchanger-1_1_0004|vchanger-1_1_0005|vchanger-1_1_0007|vchanger-1_1_0006|vchanger-1_1_0008|vchanger-1_1_0009|vchanger-1_1_0010|vchanger-1_1_0011|vchanger-1_1_0012|vchanger-1_1_0013|vchanger-1_1_0014|vchanger-1_1_0015|vchanger-1_1_0016|vchanger-1_1_0017

Since I parse the data in a general way similarily to your CEF method:

[bareos-content-fields]
SOURCE_KEY = message
REGEX = ^\s+([^:]+):\s*([^\r\n]+?)[\r\n]
FORMAT = $1::$2

I'm getting it parsed out as a field called Volume_name_s_ (after all spaces and symbols are automatically corrected).

Without a tokenizer of course I get a single value with multiple pipe-joined labels.

If I define

[Volume_name_s_]
TOKENIZER = (\w+-\d+_\d+_\d+)

in my fields.conf, the tokenizer works properly and splits my list of volumes into a multivalued field.

But when I initially tried an approach similar to yours 

TOKENIZER = ([^|]+)

it wouldn't work.

And I have no idea why. 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. Scratch that.

It's some quirkness of the UI.

Regardless of whichever version of the TOKENIZER I use if I do a search 

<my_base_search>
| eval mvcount=mvcount(Volume_name_s_)
| table Volume_name_s_ mvcount

I get (whenever applicable) a proper multivalued field in my table and a count of a dozen or so values.

But.

The UI displays the values differently depending on which form I use.

If I use the 

TOKENIZER = (\w+-\d+_\d+_\d+) 

version, when I expand the event contents to see extracted values I see each value on separate line

If I use the 

TOKENIZER = ([^|]+)

form, all values are crammed into a single line (but they no longer have pipes between them, just spaces).

Strange.

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Community Content Calendar, September edition

Welcome to another insightful post from our Community Content Calendar! We're thrilled to continue bringing ...

Splunkbase Unveils New App Listing Management Public Preview

Splunkbase Unveils New App Listing Management Public PreviewWe're thrilled to announce the public preview of ...

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you leveraging automation to its fullest potential in your threat detection strategy?Our upcoming Security ...