Extracting "_" delimited fields from source file name (regex101.com)
([^\/]+)([^]+)([^]+)([^]+)([^]+)bro([^]+)([^]+)([^_]+).[c][s][v]
/data/input/account_network_system_interface_host_application_special-source-type_timestamp_seqnum.csv
Group 1. 12-19 account
Group 2. 20-26 network
Group 3. 27-33 system
Group 4. 34-43 interface
Group 5. 44-48 host
Group 6. 53-64 special-source-type
Group 7. 65-74 timestamp
Group 8. 75-81 seqnum
Splunk ingest:
/
/ inputs.conf
/
[batch:///data/input/_*_application__*.csv]
sourcetype = application
disabled =0
move_policy = sinkhole
crcSalt =
/
/ props.conf
/
[application]
..
TRANSFORMS-application-auto-type = application-auto-type
...
/
/ transforms.conf
/
...
[application-auto-type]
SOURCE_KEY = MetaData:Source
DEST_KEY = MetaData:Sourcetype
REGEX = ([^\/]+)([^]+)([^]+)([^]+)([^]+)application([^]+)([^]+)([^]+).[c][s][v]
FORMAT = sourcetype::application$6
WRITE_META = true
...
Result:
sourcetype = application_special-source-type (sourcetype field may have 0, 1, 2, or more "-"s)
Question:
How does one replace the "-" with "_" at index-time so sourcetype = application_special_source_type?
I figured out a way to do what I was trying to do. I was able to use a REGEX to grab the analyst specified sourcetype field from the source file name and since I had to use underscores to separate the fields in the source file name we had to use dashes instead of underscores in the sourcetype field as separators. To replace the dashes with underscores in the sourcetype at index time. I used props and transforms to iterate through the source file name field and replace dashes with underscores. There may be a better way. If anyone has a suggestion please chime in. This method currently supportes sourcetypes specified with up to eight dashes. I would love to see something in transforms like "REPLACE = s/-/_/g".
inputs.conf - Ingest any CSV file generated by an analyst with proper naming convention
[batch:///opt/splunk_input/input/*_*_*_*_*_analyst_*_*_*.csv]
sourcetype = analyst
move_policy = sinkhole
crcSalt = <SOURCE>
disabled = 0
props.conf - Parse the analyst generated file using required time stamp field and extracting the sourcetype from the source file field following "analyst" and change up to eight (8) dashes to underscores in the sourcetype field and add prefix "analyst_". This method always runs eight (8) times. It just works out that when matches are not found the keys I needed were not overwritten.
[analyst]
TRUNCATE = 0
SHOULD_LINEMERGE = false
DATETIME_CONFIG =
MAX_TIMESTAMP_LOOKAHEAD = 4096
INDEXED_EXTRACTIONS = CSV
TIMESTAMP_FIELDS = ts, _time, time
NO_BINARY_CHECK = false
category = Structured
pulldown_type = 1
TRANSFORMS-auto_analyst_set_fields = set_analyst_fields
TRANSFORMS-auto_analyst_set_host = set_analyst_host_to_sensor
TRANSFORMS-auto_analyst_set_index = set_index_for_analyst_sensor
TRANSFORMS-auto_analyst_set_sourcetype = set_var01_to_type, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_to_sourcetype
transforms.conf
#
# Analyst
#
# File Name Fields: client_collection_system_tag_sensor_analyst_type_timestamp_seqnum.csv
#
# REGEX: ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.csv
#
# Match Groups: < $1> _<$2 > _< $3 >_< $4 >_< $5 >_analyst_< $6 >_<$7 >_< $8 >.csv
#
#
[accepted_keys]
var01_key = _var01
var02_key = _var02
#
# Referenced in props.conf [analyst]
#
[set_analyst_fields]
SOURCE_KEY = MetaData:Source
REGEX = ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.[c][s][v]
FORMAT = analyst_client::$1 analyst_collection::$2 analyst_system::$3 analyst_tag::$4
WRITE_META = true
[set_analyst_host_to_sensor]
SOURCE_KEY = MetaData:Source
DEST_KEY = MetaData:Host
REGEX = ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.[c][s][v]
FORMAT = host::$5
DEFAULT_VALUE = unknown_analyst_host
[set_index_for_analyst_sensor]
SOURCE_KEY = MetaData:Source
DEST_KEY = _MetaData:Index
REGEX = ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.[c][s][v]
FORMAT = idx_$5
DEFAULT_VALUE = unknown_analyst_index
[set_var01_to_type]
SOURCE_KEY = MetaData:Source
DEST_KEY = _var01
REGEX = ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.[c][s][v]
FORMAT = _$6
[var01_dash_to_var02_underscore]
SOURCE_KEY = _var01
DEST_KEY = _var02
REGEX = _([^-]+)-([^.]+)
FORMAT = _$1_$2
[var02_to_var01]
SOURCE_KEY = _var02
DEST_KEY = _var01
REGEX = ([^.]+)
FORMAT = $1
[var01_to_sourcetype]
SOURCE_KEY = _var01
DEST_KEY = MetaData:Sourcetype
REGEX = _([^.]+)
FORMAT = sourcetype::analyst_$1
DEFAULT_VALUE = unknown_analyst_sourcetype
fields.conf
[analyst_mission]
INDEXED = false
[analyst_collection]
INDEXED = false
[analyst_system]
INDEXED = false
[analyst_tag]
INDEXED = true
I figured out a way to do what I was trying to do. I was able to use a REGEX to grab the analyst specified sourcetype field from the source file name and since I had to use underscores to separate the fields in the source file name we had to use dashes instead of underscores in the sourcetype field as separators. To replace the dashes with underscores in the sourcetype at index time. I used props and transforms to iterate through the source file name field and replace dashes with underscores. There may be a better way. If anyone has a suggestion please chime in. This method currently supportes sourcetypes specified with up to eight dashes. I would love to see something in transforms like "REPLACE = s/-/_/g".
inputs.conf - Ingest any CSV file generated by an analyst with proper naming convention
[batch:///opt/splunk_input/input/*_*_*_*_*_analyst_*_*_*.csv]
sourcetype = analyst
move_policy = sinkhole
crcSalt = <SOURCE>
disabled = 0
props.conf - Parse the analyst generated file using required time stamp field and extracting the sourcetype from the source file field following "analyst" and change up to eight (8) dashes to underscores in the sourcetype field and add prefix "analyst_". This method always runs eight (8) times. It just works out that when matches are not found the keys I needed were not overwritten.
[analyst]
TRUNCATE = 0
SHOULD_LINEMERGE = false
DATETIME_CONFIG =
MAX_TIMESTAMP_LOOKAHEAD = 4096
INDEXED_EXTRACTIONS = CSV
TIMESTAMP_FIELDS = ts, _time, time
NO_BINARY_CHECK = false
category = Structured
pulldown_type = 1
TRANSFORMS-auto_analyst_set_fields = set_analyst_fields
TRANSFORMS-auto_analyst_set_host = set_analyst_host_to_sensor
TRANSFORMS-auto_analyst_set_index = set_index_for_analyst_sensor
TRANSFORMS-auto_analyst_set_sourcetype = set_var01_to_type, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_dash_to_var02_underscore, \
var02_to_var01, \
var01_to_sourcetype
transforms.conf
#
# Analyst
#
# File Name Fields: client_collection_system_tag_sensor_analyst_type_timestamp_seqnum.csv
#
# REGEX: ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.csv
#
# Match Groups: < $1> _<$2 > _< $3 >_< $4 >_< $5 >_analyst_< $6 >_<$7 >_< $8 >.csv
#
#
[accepted_keys]
var01_key = _var01
var02_key = _var02
#
# Referenced in props.conf [analyst]
#
[set_analyst_fields]
SOURCE_KEY = MetaData:Source
REGEX = ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.[c][s][v]
FORMAT = analyst_client::$1 analyst_collection::$2 analyst_system::$3 analyst_tag::$4
WRITE_META = true
[set_analyst_host_to_sensor]
SOURCE_KEY = MetaData:Source
DEST_KEY = MetaData:Host
REGEX = ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.[c][s][v]
FORMAT = host::$5
DEFAULT_VALUE = unknown_analyst_host
[set_index_for_analyst_sensor]
SOURCE_KEY = MetaData:Source
DEST_KEY = _MetaData:Index
REGEX = ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.[c][s][v]
FORMAT = idx_$5
DEFAULT_VALUE = unknown_analyst_index
[set_var01_to_type]
SOURCE_KEY = MetaData:Source
DEST_KEY = _var01
REGEX = ([^\/]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_analyst_([^_]+)_([^_]+)_([^_]+)\.[c][s][v]
FORMAT = _$6
[var01_dash_to_var02_underscore]
SOURCE_KEY = _var01
DEST_KEY = _var02
REGEX = _([^-]+)-([^.]+)
FORMAT = _$1_$2
[var02_to_var01]
SOURCE_KEY = _var02
DEST_KEY = _var01
REGEX = ([^.]+)
FORMAT = $1
[var01_to_sourcetype]
SOURCE_KEY = _var01
DEST_KEY = MetaData:Sourcetype
REGEX = _([^.]+)
FORMAT = sourcetype::analyst_$1
DEFAULT_VALUE = unknown_analyst_sourcetype
fields.conf
[analyst_mission]
INDEXED = false
[analyst_collection]
INDEXED = false
[analyst_system]
INDEXED = false
[analyst_tag]
INDEXED = true
I realized after posting this question that the post text formatting removed the underscores "_" in my REGEX examples. Just know they are there and the REGEX works. I just can not figure out how to modify the fields I have captured in REGEX groups to change dashes "-" to underscores "_". I feel that I need a "replace" function in combination with the REGEX that will work at index-time. Or a way to have an additional REGEX capture the variable number of groups within an extracted fields and then allow me to concatenate the dash separated groups with underscores.
Should be:
...
FORMAT = sourcetype::application_$6
...