I have a dataset that uses some non-segmented character to separate meaningful and commonly-used search terms.
Sample events
123,SVCA,ABC123,DEF~AP~SOME_SVC123~1.0,10.0.1.2 ,67e15429-e44c-4c27-bc9a-f3462ae67125,,2023-02-10-12:00:28.578,14,ER40011,"Unauthorized"
123,SVCB,DEF456,DEF~LG~Login~1.0,10.0.1.2,cd63b821-a96c-11ed-8a7c-00000a070dc2,cd63b820-a96c-11ed-8a7c-00000a070dc2,2023-02-10-12:00:28.578,10,0,"OK"
123,SVCC,ZHY789,123~XD-ABC~OtherSvc~2.0,10.0.1.2 ,67e15429-e44c-4c27-bc9a-f3462ae67125,,2023-02-10-12:00:28.566,321,ER00000,"Success"
456,ABC1,,DEFAULT~ENTL~ASvc~1.0,10.0.1.2 ,b70a2c11-286f-44da-9013-854acb1599cd,,2023-02-10-11:59:44.830,14,ER00000,"Success"
456,DEF2,,456~LG~Login~v1.0.0,10.0.0.1,27bee310-a843-11ed-a629-db0c7ca6c807,,2023-02-10-11:59:44.666,300,1,"FAIL"
456,ZHY3,ZHY45678,DEF~AB~ANOTHER_SVC121~1.0,10.0.0.1 ,19b79e9b-e2e2-4ba2-a7cf-e65ba8da5e7b,,2023-02-10-11:58:58.813,,27,ER40011,"Unauthorized"
Users will often search for individual items separated by the ~ character. E.g.,
index=myindex sourcetype=the_above_sourcetype *LG*
My purpose is to reduce the need for leading wildcards in most searches here, as this is a high-volume dataset by adding the minor segmentation character '~' at index time.
I've tried these props.conf and segmenters.conf without success. Could anyone provide any insight?
<indexer>
SPLUNK_HOME/etc/apps/myapp/local/props.conf
[the_above_sourcetype]
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
TIME_PREFIX = ^([^,]*,){7}
TIME_FORMAT = %Y-%m-%d-%H:%M:%S.%3Q
TRUNCATE = 10000
MAX_TIMESTAMP_LOOKAHEAD=50
SEGMENTATION = my-custom-segmenter
SPLUNK_HOME/etc/apps/myapp/local/segmenters.conf
[my-custom-segmenter]
MINOR = / : = @ . - $ # % \\ _ ~ %7E
Added those and bounced my test instance, but I still cannot search for
index=myindex sourcetype=the_above_sourcetype LG
-- does not return results such as these, however *LG* as a term does return it.
456,DEF2,,456~LG~Login~v1.0.0,10.0.0.1,27bee310-a843-11ed-a629-db0c7ca6c807,,2023-02-10-11:59:44.666,300,1,"FAIL"
Some of the relevant documentation and rationale for what I've tried:
https://docs.splunk.com/Documentation/Splunk/9.0.3/Data/Setthesegmentationforeventdata
Index-time segmentation
The SEGMENTATION attribute determines the segmentation type used at index time. Here's the syntax:
[<spec>]
SEGMENTATION = <seg_rule>
[<spec>] can be:
<sourcetype>: A source type in your event data.
host::<host>: A host value in your event data.
source::<source>: A source of your event data.
SEGMENTATION = <seg_rule>
This specifies the type of segmentation to use at index time for [<spec>] events.
<seg_rule>
A segmentation type, or "rule", defined in segmenters.conf
Common settings are inner, outer, none, and full, but the default file contains other predefined segmentation rules as well.
Create your own custom rule by editing $SPLUNK_HOME/etc/system/local/segmenters.conf, as described in "Configure segmentation types".
https://docs.splunk.com/Documentation/Splunk/9.0.3/Admin/Segmentersconf
* Name your stanza. * Follow this stanza name with any number of the following setting/value pairs. * If you don't specify a setting/value pair, Splunk will use the default. MAJOR = <space separated list of breaking characters> * Set major breakers. * Major breakers are words, phrases, or terms in your data that are surrounded by set breaking characters. * By default, major breakers are set to most characters and blank spaces. * Typically, major breakers are single characters. * Note: \s represents a space; \n, a newline; \r, a carriage return; and \t, a tab. * Default is [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29 MINOR = <space separated list of strings> * Specifies minor breakers. * In addition to the segments specified by the major breakers, for each minor breaker found, Splunk indexes the token from the last major breaker to the current minor breaker and from the last minor breaker to the current minor breaker. * Default: / : = @ . - $ # % \\ _
I wrote the custom segmenters.conf stanza to inherit the default values of everything but attribute MINOR and simply appended ~ and the ascii code %7E for ~ at the end.
However, this did not segment my data properly at index time.