I’m working with CEF logs in Splunk where some fields contain comma-separated values.
Find a generic solution so that any field containing comma-separated values is automatically treated as a true multi-value field during field extraction — without needing to define each field name individually in props.conf file.
Example event:
CEF:0|vendor|product|1.0||||dst_ip=172.18.20.16,172.18.20.12,172.18.20.13,172.18.20.10|src_ip=10.1.1.1,10.1.1.2|user_list=alice,bob,charlie|error_codes=ERR101,ERR102|app_names=Splunk,ServiceNow,Elastic|location=datacenter-1|priority=high|status=open
Current config
1. props.conf:
[my:sourecetype]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
REPORT-generic_field_extraction = generic_key_value_extraction
EVAL-dst_ip = split(dst_ip, ",")
EVAL-src_ip = split(src_ip, ",")
EVAL-user_list = split(user_list, ",")
EVAL-error_codes = split(error_codes, ",")
EVAL-app_names = split(app_names, ",")
2. transforms.conf:
[generic_key_value_extraction]
REGEX = (?<_KEY_1>[^=|]+)=(".*?"|[^|]+)
FORMAT = $1::$2
MV_ADD = true
Below is the complete log:
CEF:0|Honeywell|CyberPredict|1.0||||dst_ip=172.18.30.21,172.18.30.22,172.18.30.23|src_ip=10.10.10.1,10.10.10.2|user_list=alice,bob,charlie|error_codes=ERR201,ERR202|app_names=Splunk,ServiceNow,Elastic|location=datacenter-east|priority=critical|status=active|a.b.1.id=B1|a.b.2.id=B2|a.b.2.type=network|a.b.1.status.online=yes|a.b.3.id=B3|a.b.3.status.online=no
The dot (.) notation in field names represents hierarchical or nested data structures, as shown below in the JSON format:
{
"a": {
"b": [
{
"id": "B1",
"type": "network",
"status": {
"online": "yes"
}
},
{
"id": "B2",
"type": "application",
"status": {
"online": "yes"
}
},
{
"id": "B3",
"type": "endpoint",
"status": {
"online": "no"
}
}
]
}
}
I see CEF I cry 😉
But seriously - instead of extracting or calculating the values (for which you will have to provide the names), you can use the TOKENIZER functionality in fields.conf
The pro is that fields.conf entries accepts wildcards.
The con is that wildcards do their job and get all matching fields so if you define a tokenizer for _all_ fields in your sourcetype, it will be splitting all fields and there's no way to exclude specific fields.
See https://docs.splunk.com/Documentation/Splunk/latest/Admin/Fieldsconf#.5B.26lt.3Bfield_name.26gt.3B.7... for more info
@PickleRick
The Splunk is not able to parse multi-value items, all the comma separated values are being parsed as a single value.
I have used this new configs:
1. props.conf:
[my:sourcetype1]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
REPORT-generic_field_extraction = generic_key_value_extraction
[my:sourcetype2]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
REPORT-generic_field_extraction = generic_key_value_extraction
2. transforms.conf:
[generic_key_value_extraction]
# Dynamically capture ANY key=value pair, allowing commas inside the value
# The value ends ONLY when a pipe "|" or end of line is reached
REGEX = (?<_KEY_1>[^=|]+)=(".*?"|[^|]+)
FORMAT = $1::$2
# Allow multiple matches for same key
MV_ADD = true
3. fields.conf:
# Apply to ALL fields for both sourcetypes
[*,sourcetype::my:sourcetype1,sourcetype::mysourcetype2]
TOKENIZER = ([^,]+)
[*,sourcetype::my:sourcetype1,sourcetype::mysourcetype2]
Wrong syntax
[sourcetype::my_sourcetype1::*]
if you want all fields for my_sourcetype1 (you can't wildcard the sourcetype itself).
@PickleRick I have tried using this syntax,
[sourcetype::my_sourcetype1::*]
But still it is not working for me.
Below is my fields.conf
[sourcetype::my_sourcetype1::*]
TOKENIZER = ([^,]+)
You might want to call out to support team. The general functionality is there but it seems to be sensitive to some undocummented stuff.
1. The sourcetype-based definitions are supposed to work (and Splunk by default has definitions for [sourcetype::splunk_resource_usage::data*] so either it's a long-standing not working example which they've been shipping for a loooong time and not notice because handling indexed fields improved over the years)
2. Even if I define my TOKENIZER for a field specified by general name, not sourcetype-bound, it seems to sometimes work, sometimes not.
Example - my data contains job logs from Bareos. Events contain a line (they are multiline)
Volume name(s): vchanger-1_1_0002|vchanger-1_1_0004|vchanger-1_1_0005|vchanger-1_1_0007|vchanger-1_1_0006|vchanger-1_1_0008|vchanger-1_1_0009|vchanger-1_1_0010|vchanger-1_1_0011|vchanger-1_1_0012|vchanger-1_1_0013|vchanger-1_1_0014|vchanger-1_1_0015|vchanger-1_1_0016|vchanger-1_1_0017
Since I parse the data in a general way similarily to your CEF method:
[bareos-content-fields]
SOURCE_KEY = message
REGEX = ^\s+([^:]+):\s*([^\r\n]+?)[\r\n]
FORMAT = $1::$2
I'm getting it parsed out as a field called Volume_name_s_ (after all spaces and symbols are automatically corrected).
Without a tokenizer of course I get a single value with multiple pipe-joined labels.
If I define
[Volume_name_s_]
TOKENIZER = (\w+-\d+_\d+_\d+)
in my fields.conf, the tokenizer works properly and splits my list of volumes into a multivalued field.
But when I initially tried an approach similar to yours
TOKENIZER = ([^|]+)
it wouldn't work.
And I have no idea why.
OK. Scratch that.
It's some quirkness of the UI.
Regardless of whichever version of the TOKENIZER I use if I do a search
<my_base_search>
| eval mvcount=mvcount(Volume_name_s_)
| table Volume_name_s_ mvcount
I get (whenever applicable) a proper multivalued field in my table and a count of a dozen or so values.
But.
The UI displays the values differently depending on which form I use.
If I use the
TOKENIZER = (\w+-\d+_\d+_\d+)
version, when I expand the event contents to see extracted values I see each value on separate line
If I use the
TOKENIZER = ([^|]+)
form, all values are crammed into a single line (but they no longer have pipes between them, just spaces).
Strange.