While performing a search for log messages that contain the string "URIBL_" I got a lot less hits than by grepping the same log. I used the same daily log for grep and Splunk.
Using grep:
grep -c URIBL_ /logs/an_input.log
18686
This was the search in Splunk:
index=mail source="/logs/an_input.log" uribl_
Instead of the 18,686 events it only returned 15.
I got the desired results by using the following search:
index=mail source="/logs/an_input.log" uribl_*
It correctly returns 18,686 events.
The search
index=mail source="/logs/an_input.log" uribl
returns the wanted results plus a few more since the undesrscore no longer needs to be matched.
Among the fields in the log message are: URIBL_BLACK=1.5, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.5
I tried the search with the Field discovery turned on and off. The results were the same.
It looks like there is a difference in how Splunk reacts to the underscore in searches.
I had treated it as a regular letter or number and got the wrong results.
What does the underscore actually mean when it is used in the search and how does is acffect the search process?
The reason you get hits for "uribl" but not "uribl_" is because "_" is one of the characters Splunk considers to be a delimiter when dividing incoming data into individual segments to index. Basically if you have an event that contains, say, the string "my_string_with_underscores
", Splunk will create 5 segments out of this: "my
", "string
", "with
", "underscores
" and finally the whole string as well, "my_string_with_underscores
". This way if you search for "with", Splunk won't first have to retrieve ALL events and then do an equivalent to grep to see which ones have the string "with" in them. Instead it can just check which events have the segment "with" in them. This is way better explained in the docs: http://docs.splunk.com/Documentation/Splunk/5.0/Data/Abouteventsegmentation
Also the documentation for segmenters.conf
shows you which default values are used. http://docs.splunk.com/Documentation/Splunk/5.0/Admin/Segmentersconf
So if you search for "URIBL_BLACK
" you will get results, because that is a major segment. If you search for "URIBL
" you will get results as well, because it's a minor segment in that string. If you search for "URIBL_
" you will not get results because it's neither a major nor minor segment because the delimiter will not be included in the segment. I hope that clears things up at least a bit rather than add more to the confusion 🙂
The reason you get hits for "uribl" but not "uribl_" is because "_" is one of the characters Splunk considers to be a delimiter when dividing incoming data into individual segments to index. Basically if you have an event that contains, say, the string "my_string_with_underscores
", Splunk will create 5 segments out of this: "my
", "string
", "with
", "underscores
" and finally the whole string as well, "my_string_with_underscores
". This way if you search for "with", Splunk won't first have to retrieve ALL events and then do an equivalent to grep to see which ones have the string "with" in them. Instead it can just check which events have the segment "with" in them. This is way better explained in the docs: http://docs.splunk.com/Documentation/Splunk/5.0/Data/Abouteventsegmentation
Also the documentation for segmenters.conf
shows you which default values are used. http://docs.splunk.com/Documentation/Splunk/5.0/Admin/Segmentersconf
So if you search for "URIBL_BLACK
" you will get results, because that is a major segment. If you search for "URIBL
" you will get results as well, because it's a minor segment in that string. If you search for "URIBL_
" you will not get results because it's neither a major nor minor segment because the delimiter will not be included in the segment. I hope that clears things up at least a bit rather than add more to the confusion 🙂
Your answer filled in the missing pieces of the puzzle. No confusion added 🙂