Splunk Search

TERM and PREFIX cannot find string with two dashes

PavelP
Motivator

any ideas on TERM and PREFIX limitations with double dashes?

 

 

cat /tmp/test.txt
abc//xyz
abc::xyz
abc==xyz
abc@@xyz
abc..xyz
abc--xyz
abc$$xyz
abc##xyz
abc%%xyz
abc\\xyz
abc__xyz
search abc--xyz # works
TERM(abc--xyz) # doesn't work
TERM(abc*) # works
| tstats count by PREFIX(abc) # doesn't work for abc--xyz

 

 

 Both TERM and PREFIX work with other minor segmenters like dots or underscores. 

 

Labels (1)
1 Solution

tscroggins
Influencer

Hi @PavelP,

This isn't an issue with TERM or PREFIX but with how Splunk indexes abc--xyz.

We can use walklex to list terms in our index:

| walklex index=main type=term
| table term

We'll find the following:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

Note that abc--xyz is missing. Let's look at segmenters.conf. The default segmenter stanza is [indexing]:

[indexing]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _

Note that -- is a major breaker. If we index abc-xyz with a single hyphen, we should find abc-xyz in the list of terms:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc-xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

If walklex returns a missing merged_lexicon.lex message, we can force optimization of the bucket(s) to generate the data, e.g.:

$SPLUNK_HOME/bin/splunk-optimize-lex -d $SPLUNK_HOME/var/lib/splunk/main/db/hot_v1_0

We can override major breakers in a custom segmenters.conf stanza and reference the stanza in props.conf. Ensure the segmenter name is unique and remove -- from the MAJOR setting:

# segmenters.conf

[tmp_test_txt]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _

# props.conf

[source::///tmp/test.txt]
SEGMENTATION = tmp_test_txt

Deploy props.conf and segmenters.conf to both search heads and search peers (indexers).

With the new configuration in place, walklex should return abc--xyz in the list of terms:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc--xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

We can now use TERM and PREFIX as expected:

| tstats values(PREFIX(abc--)) as vals where index=main TERM(abc--*) by PREFIX(abc--)
abc--vals
xyzxyz

 

As always, we should ask ourselves if changing the default behavior is both required and desired. Isolating the segmentation settings by source or sourcetype will help mitigate risk.

View solution in original post

PavelP
Motivator

Hey folks, breaking news for the TERM/PREFIX enthusiasts! Brace yourselves – our TERM searches cannot find punycode encoded domains!

🙂

xn--bcher-kva.de

https://en.m.wikipedia.org/wiki/Punycode

 

0 Karma

tscroggins
Influencer

Now that is a valid use case for modifying segmentation; however, the impact is wide-reaching. You may also want to look at setting INTERMEDIATE_MAJORS = true, although that could result in a significant indexing performance impact.

Which access log formats and source types do you most commonly use?

PickleRick
SplunkTrust
SplunkTrust

+1 on that. The impact is limited to where you use the custom segmenter (you set it for specific props stanza).

0 Karma

PavelP
Motivator

Affected are tstats/TERM/PREFIX and accelerated DM searches. This isn't limited to punycode domains; any value with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.

The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.

Here are a couple of examples to illustrate the issue:

  1. Working URL: hp--community.force.com
  2. Path: /tmp/folder--xyz/test-----123.txt, c:\Windows\Temp\test---abc\abc--123.dat
  3. Username: admin--haha
  4. User-Agent: Mozilla/5.0--findme
0 Karma

PavelP
Motivator

please consider upvote a new Splunk idea to get more attention: https://ideas.splunk.com/ideas/EID-I-2226

0 Karma

tscroggins
Influencer

A custom segmenter has merit in this case, but globally, folks will recommend tagging events with an appropriate add-on (or custom configuration) and using an accelerated Web or other data model to find matching URLs, hostnames, etc.

0 Karma

PavelP
Motivator

Affected are tstats/TERM/PREFIX and accelerated DM searches. This isn't limited to punycode domains; any domain with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.

The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.

Here are a couple of examples to illustrate the issue:

  1. Working URL: https://hp--community.force.com
  2. Path: /tmp/folder--xyz/test-----123.txt, c:\Windows\Temp\test---abc\abc--123.dat
  3. Username: admin--haha
  4. User-Agent: Mozilla/5.0--findme
0 Karma

tscroggins
Influencer

I missed your comment re: accelerated data models earlier. The field values should be available at search time, either from _raw or tsidx, and then stored in the summary index. Off the top of my head, I don't know if the segmenters impact INDEXED_EXTRACTIONS = w3c, but they shouldn't impact transforms-based indexed extractions or search-time field extractions from other source types.

0 Karma

PavelP
Motivator

Affected are tstats/TERM/PREFIX searches and accelerated DM searches. I haven't conducted a thorough check yet, but it seems that searches on accelerated DM may overlook fields with double dashes. This isn't limited to punycode domains; any field value with continuous hyphens may be affected. Consider usernames, user-agents, URL paths and queries, file names, and file paths – the range of affected fields is extensive.

The implications extend to premium apps like Enterprise Security, heavily reliant on accelerated DMs. Virtually every source and sourcetype could be impacted, including commonly used ones like firewall, endpoint, windows, proxy, etc.

Here are a couple of examples to illustrate the issue:

  1. Working URL: https://hp--community.force.com
  2. Path: /tmp/back--door/test-----backdoor.txt, c:\Windows\Temp\back--door\test---backdoor.exe
  3. Username: admin--backdoor
  4. User-Agent: Mozilla/5.0--backdoor
0 Karma

tscroggins
Influencer

Hi @PavelP,

This isn't an issue with TERM or PREFIX but with how Splunk indexes abc--xyz.

We can use walklex to list terms in our index:

| walklex index=main type=term
| table term

We'll find the following:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

Note that abc--xyz is missing. Let's look at segmenters.conf. The default segmenter stanza is [indexing]:

[indexing]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _

Note that -- is a major breaker. If we index abc-xyz with a single hyphen, we should find abc-xyz in the list of terms:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc-xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

If walklex returns a missing merged_lexicon.lex message, we can force optimization of the bucket(s) to generate the data, e.g.:

$SPLUNK_HOME/bin/splunk-optimize-lex -d $SPLUNK_HOME/var/lib/splunk/main/db/hot_v1_0

We can override major breakers in a custom segmenters.conf stanza and reference the stanza in props.conf. Ensure the segmenter name is unique and remove -- from the MAJOR setting:

# segmenters.conf

[tmp_test_txt]
INTERMEDIATE_MAJORS = false
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D %2520 %5D %5B %3A %0A %2C %28 %29
MINOR = / : = @ . - $ # % \\ _

# props.conf

[source::///tmp/test.txt]
SEGMENTATION = tmp_test_txt

Deploy props.conf and segmenters.conf to both search heads and search peers (indexers).

With the new configuration in place, walklex should return abc--xyz in the list of terms:

abc
abc##xyz
abc$$xyz
abc%%xyz
abc--xyz
abc..xyz
abc//xyz
abc==xyz
abc@@xyz
abc\\xyz
abc__xyz
xyz

We can now use TERM and PREFIX as expected:

| tstats values(PREFIX(abc--)) as vals where index=main TERM(abc--*) by PREFIX(abc--)
abc--vals
xyzxyz

 

As always, we should ask ourselves if changing the default behavior is both required and desired. Isolating the segmentation settings by source or sourcetype will help mitigate risk.

PickleRick
SplunkTrust
SplunkTrust

Nice one. I even checked the specs for segmenters.conf and while I noticed the single dash as minor segmenter, I completely missed the double dash. (Though it is "hidden" relatively far in the default declaration and surounded by all those other entities).

tscroggins
Influencer

Somewhere in Splunk history, there's a developer who did the lexicographically correct thing knowing it would stymy future Splunkers. Let's raise a glass to the double oblique hyphen (thanks, Wikipedia)!

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Double oblique hyphen is U+2E17 and looks like this:

0 Karma

tscroggins
Influencer

It was just a Wikipedia joke: "In Latin script, the double hyphen is a punctuation mark that consists of two parallel hyphens. It was a development of the earlier double oblique hyphen ...." I'm assuming an early developer analyzed a suitable corpus of log content and determined a double hyphen or long dash should be considered a major breaker.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Well, double hyphen is really a poor-man's approximation of an em-dash or en-dash and I don't recall seeing them outside of TeX sources so I was pretty surprised to find it in segmenters.

Anyway, punctuation is not a part of the script. Many languages using latin script use (slightly) different punctuation systems and languages using different scripts (like cyryllic) use very similar punctuation 😛

But we're drifting heavily off-topic.

0 Karma

isoutamo
SplunkTrust
SplunkTrust
Nice explanation and nice way to get values to work with tstats!

PickleRick
SplunkTrust
SplunkTrust

That is interesting. Didn't have oportunity to test it but if it is so, it looks like a support case material.

See my other reply.

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

❄️ Celebrate the season with our December lineup of Community Office Hours, Tech Talks, and Webinars! ...

Splunk and Fraud

Watch Now!Watch an insightful webinar where we delve into the innovative approaches to solving fraud using the ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...