any ideas on TERM and PREFIX limitations with double dashes?
cat /tmp/test.txt
abc//xyz
abc::xyz
abc==xyz
abc@@xyz
abc..xyz
abc--xyz
abc$$xyz
abc##xyz
abc%%xyz
abc\\xyz
abc__xyz
search abc--xyz # works
TERM(abc--xyz) # doesn't work
TERM(abc*) # works
| tstats count by PREFIX(abc) # doesn't work for abc--xyz
Both TERM and PREFIX work with other minor segmenters like dots or underscores.
Hi @PavelP,
This isn't an issue with TERM or PREFIX but with how Splunk indexes abc--xyz.
We can use walklex to list terms in our index:
| walklex index=main type=term
| table term
We'll find the following:
abc
abc##xyz
abc$$xyz
abc%%xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz
Note that abc--xyz is missing. Let's look at segmenters.conf. The default segmenter stanza is [indexing]:
[indexing]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _
Note that -- is a major breaker. If we index abc-xyz with a single hyphen, we should find abc-xyz in the list of terms:
abc
abc##xyz
abc$$xyz
abc%%xyz
abc-xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz
If walklex returns a missing merged_lexicon.lex message, we can force optimization of the bucket(s) to generate the data, e.g.:
$SPLUNK_HOME/bin/splunk-optimize-lex -d $SPLUNK_HOME/var/lib/splunk/main/db/hot_v1_0
We can override major breakers in a custom segmenters.conf stanza and reference the stanza in props.conf. Ensure the segmenter name is unique and remove -- from the MAJOR setting:
# segmenters.conf
[tmp_test_txt]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _
# props.conf
[source::///tmp/test.txt]
SEGMENTATION = tmp_test_txt
Deploy props.conf and segmenters.conf to both search heads and search peers (indexers).
With the new configuration in place, walklex should return abc--xyz in the list of terms:
abc
abc##xyz
abc$$xyz
abc%%xyz
abc--xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz
We can now use TERM and PREFIX as expected:
| tstats values(PREFIX(abc--)) as vals where index=main TERM(abc--*) by PREFIX(abc--)
abc-- | vals |
xyz | xyz |
As always, we should ask ourselves if changing the default behavior is both required and desired. Isolating the segmentation settings by source or sourcetype will help mitigate risk.
Hey folks, breaking news for the TERM/PREFIX enthusiasts! Brace yourselves – our TERM searches cannot find punycode encoded domains!
🙂
xn--bcher-kva.de
https://en.m.wikipedia.org/wiki/Punycode
Now that is a valid use case for modifying segmentation; however, the impact is wide-reaching. You may also want to look at setting INTERMEDIATE_MAJORS = true, although that could result in a significant indexing performance impact.
Which access log formats and source types do you most commonly use?
+1 on that. The impact is limited to where you use the custom segmenter (you set it for specific props stanza).
Affected are tstats/TERM/PREFIX and accelerated DM searches. This isn't limited to punycode domains; any value with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.
The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.
Here are a couple of examples to illustrate the issue:
please consider upvote a new Splunk idea to get more attention: https://ideas.splunk.com/ideas/EID-I-2226
A custom segmenter has merit in this case, but globally, folks will recommend tagging events with an appropriate add-on (or custom configuration) and using an accelerated Web or other data model to find matching URLs, hostnames, etc.
Affected are tstats/TERM/PREFIX and accelerated DM searches. This isn't limited to punycode domains; any domain with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.
The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.
Here are a couple of examples to illustrate the issue:
I missed your comment re: accelerated data models earlier. The field values should be available at search time, either from _raw or tsidx, and then stored in the summary index. Off the top of my head, I don't know if the segmenters impact INDEXED_EXTRACTIONS = w3c, but they shouldn't impact transforms-based indexed extractions or search-time field extractions from other source types.
Affected are tstats/TERM/PREFIX searches and accelerated DM searches. I haven't conducted a thorough check yet, but it seems that searches on accelerated DM may overlook fields with double dashes. This isn't limited to punycode domains; any field value with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.
The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.
Here are a couple of examples to illustrate the issue:
Hi @PavelP,
This isn't an issue with TERM or PREFIX but with how Splunk indexes abc--xyz.
We can use walklex to list terms in our index:
| walklex index=main type=term
| table term
We'll find the following:
abc
abc##xyz
abc$$xyz
abc%%xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz
Note that abc--xyz is missing. Let's look at segmenters.conf. The default segmenter stanza is [indexing]:
[indexing]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _
Note that -- is a major breaker. If we index abc-xyz with a single hyphen, we should find abc-xyz in the list of terms:
abc
abc##xyz
abc$$xyz
abc%%xyz
abc-xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz
If walklex returns a missing merged_lexicon.lex message, we can force optimization of the bucket(s) to generate the data, e.g.:
$SPLUNK_HOME/bin/splunk-optimize-lex -d $SPLUNK_HOME/var/lib/splunk/main/db/hot_v1_0
We can override major breakers in a custom segmenters.conf stanza and reference the stanza in props.conf. Ensure the segmenter name is unique and remove -- from the MAJOR setting:
# segmenters.conf
[tmp_test_txt]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _
# props.conf
[source::///tmp/test.txt]
SEGMENTATION = tmp_test_txt
Deploy props.conf and segmenters.conf to both search heads and search peers (indexers).
With the new configuration in place, walklex should return abc--xyz in the list of terms:
abc
abc##xyz
abc$$xyz
abc%%xyz
abc--xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz
We can now use TERM and PREFIX as expected:
| tstats values(PREFIX(abc--)) as vals where index=main TERM(abc--*) by PREFIX(abc--)
abc-- | vals |
xyz | xyz |
As always, we should ask ourselves if changing the default behavior is both required and desired. Isolating the segmentation settings by source or sourcetype will help mitigate risk.
Nice one. I even checked the specs for segmenters.conf and while I noticed the single dash as minor segmenter, I completely missed the double dash. (Though it is "hidden" relatively far in the default declaration and surounded by all those other entities).
Somewhere in Splunk history, there's a developer who did the lexicographically correct thing knowing it would stymy future Splunkers. Let's raise a glass to the double oblique hyphen (thanks, Wikipedia)!
Double oblique hyphen is U+2E17 and looks like this: ⸗
It was just a Wikipedia joke: "In Latin script, the double hyphen ⹀ is a punctuation mark that consists of two parallel hyphens. It was a development of the earlier double oblique hyphen ...." I'm assuming an early developer analyzed a suitable corpus of log content and determined a double hyphen or long dash should be considered a major breaker.
Well, double hyphen is really a poor-man's approximation of an em-dash or en-dash and I don't recall seeing them outside of TeX sources so I was pretty surprised to find it in segmenters.
Anyway, punctuation is not a part of the script. Many languages using latin script use (slightly) different punctuation systems and languages using different scripts (like cyryllic) use very similar punctuation 😛
But we're drifting heavily off-topic.
That is interesting. Didn't have oportunity to test it but if it is so, it looks like a support case material.
See my other reply.