Splunk Search

Why am I getting inconsistent event counts when using wildcard characters to match event field values?

splunkIT
Splunk Employee
Splunk Employee

For example, I have indexed the following six events and splunk has successfully extracted the fields quite nicely:

 2015-10-30T15:40:25.142770+01:00  INFO os_pid="21352"  project="g02lox14d1x77mvvacm9p1lkcpv1au9j" status="FINISHED"   severity=notice  app="uat_staging-mgmr"
    2015-10-30T15:40:24.818732+01:00  INFO os_pid="21352" project="g02lox14d1x77mvvacm9p1lkcpv1au9j"  status="START"   severity=notice  app=uat_staging-mgmr
    2015-10-30T15:40:23.873253+01:00  INFO os_pid="645" project_name="aktp4wwo6oct7ia51r9ifv6trqarrujv" status="START" severity=notice app="uat_staging-mgmr"
    2015-10-30T15:40:23.798752+01:00  INFO os_pid="645" project="aktp4wwo6oct7ia51r9ifv6trqarrujv" status="START"   severity=notice  app=uat_staging-mgmr
    2015-10-30T17:41:51.877989+01:00  INFO os_pid="645" project="aktp4wwo6oct7ia51r9ifv6trqarrujv"  status="FINISHED"   severity=notice  app=uat_staging-mgmr
    2015-10-30T17:41:51.852251+01:00  INFO os_pid="645" project="aktp4wwo6oct7ia51r9ifv6trqarrujv"  status="FINISHED"   severity=notice  app="uat_staging-mgmr"

However, when I search using the a wildcard for the app field, I would get inconsistent results:

Test-1

index="blah" sourcetype="gooddata" app="uat*" 

Result count: 6

Test-2

index="blah" sourcetype="gooddata" app="uat*staging-mgmr" 

Result count: 0

Test-3

 index="blah" sourcetype="gooddata" app="uat_staging*mgmr" 

Result count: 0

Test-4

 index="blah" sourcetype="gooddata" app="uat*staging*mgmr" 

Result count: 3

So are the dash (-) and underscore (_) or any special characters considered as reserved characters from the search string?

Looking at the search.log for the individual test cases, I'd notice that the base lispy has excluded the dash and underscore characters:
Test-2

INFO  UnifiedSearch - base lispy: [ AND mgmr sourcetype::gooddata uat*staging [ EQ index 286075 ] ]

Test-3

INFO  UnifiedSearch - base lispy: [ AND sourcetype::gooddata staging*mgmr uat [ EQ index 286075 ] ]

Test-4

INFO  UnifiedSearch - base lispy: [ AND sourcetype::gooddata uat*staging*mgmr [ EQ index 286075 ] ]

I would like to to know if this is a known limitation/bug. Hopefully, someone mare familiar in this area can help in explaining this behavior.

1 Solution

woodcock
Esteemed Legend

You are probably running in to this well-known problem:

http://blogs.splunk.com/2011/10/07/cannot-search-based-on-an-extracted-field/

The solution is to put this into fields.conf in the same directory that you have your field extractions (where props.conf is):

[app]
INDEXED_VALUE = false

View solution in original post

cpride_splunk
Splunk Employee
Splunk Employee

To do some more in depth explanation - but this doesn't actually have much of a solution.

First we need to cover how we actually index the strings in raw, which we then use for searching the index.

For simplicity I'm going to reduce the examples above so that they just use the app piece of the events and searches -- So you have:

app="uat_staging-mgmr"
app=uat_staging-mgmr
app="uat_staging-mgmr"
app=uat_staging-mgmr
app=uat_staging-mgmr
app="uat_staging-mgmr"

So you have two families of value there. You have quoted and un-quoted. The way we break this up when we are indexing is we rely on the configuration values in a file segmenters.conf. There are bunch of things in there but the main things we care about are that there are MAJOR and MINOR segments at the top level.

When we are indexing we first break things by MAJOR segmenters and index the resulting tokens and then additionally break on MINOR segmenters and index those tokens. Given the example values the segmenters that matter in this case (there are more than just these go look at the file if you want to see more):

MAJOR = " %20 # This is using url encoding for characters that needed it.
MINOR = _ - =

There are only really two types of events and as a results of the breaker we get these tokens:

 a. app="uat_staging-mgmr" 
   Tokens(with MAJOR): app= | uat_staging-mgr 
   Tokens(with MINOR): app | uat | staging | mgr
 b. app=uat_staging-mgmr 
   Tokens(with MAJOR): app=uat_staging-mgr
   Tokens(with MINOR): app | uat | staging | mgr

So note that the '"' characters impact how we break up the major tokens because '"' is a major breaker, also note that '=' is not a major breaker.

So on to how we treat our searches.

I'm going to skip over a lot of complications (transforms, lookups, calculated fields, indexed fields, eventtypes, etc.). A abbreviated primer, search is done in a number of phases.
1. Search against index
- This has to use the tokens that we discuss above, so we need to adjust the search for that. This is also where LISPY applies.
2. ...
3. Field Extractions
4. ...
5. Post Filter
- This is where we actually assert the field value equivalence.
6. ...

So for these searches:

 1. app="uat*" 
 2. app="uat*staging-mgmr" 
 3. app="uat_staging*mgmr" 
 4. app="uat*staging*mgmr"

So from these for the first part we need to prepare the LISPY for our purposes we treat all breakers both MAJOR and MINOR at the indexing level as MAJOR for search. Wildcards however do not count as breakers by necessity. (So that 'foo' is matched by 'f*o'.) So that results in the following LISPY strings:

 1. [AND uat* ]
 2. [AND mgmr uat*staging ] 
 3. [AND staging*mgmr uat ] 
 4. [AND uat*staging*mgmr ]

Now comparing these to the events and the tokens above:

[AND uat* ]

This will match against all of the events because all of them have a 'uat' token.

[AND uat*staging mgmr] 

This one 'mgmr' matches the token from the minor breakers but 'uat*staging' does not match any token as no token starts with 'uat' and ends with 'staging'.

[AND uat staging*mgmr]

This one 'uat' matches the token from the minor breakers but 'staging*mgmr' does not match any token as no token starts with 'staging' and ends with 'mgmr'.

[AND uat*staging*mgmr]

This one 'uat*staging*mgmr' matches no minor tokens, and for major tokens in the case where the major breakers resulted in a token 'uat_staging-mgmr' (event (a) above.) it matches. This was the case with the quoted value what had a major breaker after the '=' but in the events with an unquoted value there was no major breaker after the '=' resulting a large major token which doesn't begin with 'uat' it begins with 'app' thus this does not match.

So in summary:
- Wildcards across punctuation, particularly midfix wildcards, make matching searches much more complicated.
- As mentioned above by @splunkIT there are known defects tracking this. So far however we don't have a good solution that doesn't have an undesirable performance impact.
- The work-around suggested by @woodcock basically flags us to not use this field in the LISPY and rely on post filtering exclusively.
- One question that comes up in this specific case is about '=' as a minor breaker causing the weirdness in query (4). There are some features that benefit from having '=' as a minor breaker. Additionally the breakers have not changed in a long time, as changing breakers on an existing instance can cause significant correctness errors.

Suggested rules of thumb for writing searches:
- Avoid midfix wildcards where possible.
- Prefix wildcards don't have this problem but they are a performance problem.
- If you are trying to match punctuation try to use all the punctuation don't use wildcards for matching punctuation when possible.

splunkIT
Splunk Employee
Splunk Employee

Thank you @cpride, for the eloquent explanation.

0 Karma

woodcock
Esteemed Legend

You are probably running in to this well-known problem:

http://blogs.splunk.com/2011/10/07/cannot-search-based-on-an-extracted-field/

The solution is to put this into fields.conf in the same directory that you have your field extractions (where props.conf is):

[app]
INDEXED_VALUE = false

View solution in original post

splunkIT
Splunk Employee
Splunk Employee

Thanks @woodcock. Setting "INDEXED_VALUE = false" appears to work, and I didn't have to do any regex extraction, since the raw data is already in kv pairs. By default "kv_mode = auto" (props.conf); so that works out well.

Keep in mind that "INDEXED_VALUE = false" might negatively impact search performance in some cases, since now that you are using regex field extractions during search time.

FYI: There are two known defects pertaining to this problem, affecting splunk versions 5.x and 6.x:
SPL-76801
SPL-109309

0 Karma