I have JSON data, which is indexed and can be searched. This is an example of the data
Product: { [-]
BottleSizeMls: 750mls
BottleSizeName: Bottle
Id: 0
Notes: null
Title: MOSS WOOD Ribbon Vale Merlot, Margaret River 2013
Winery: null
}
I have 4 searches, the first three work and the last one does not.
Product.Title="MOSS WOOD Ribbon Vale Merlot, Margaret River*"
Product.Title="MOSS WOOD Ribbon Vale Merlot, Margaret River 2013"
Product.Title="MOSS WOOD Ribbon Vale Merlot, Margaret River*" Product.Title="*2013"
Product.Title="MOSS WOOD Ribbon Vale Merlot, Margaret River*2013"
I need to use the wildcard as the third party data is inconsistent and sometimes comes with extra words before the year rather than just the single space.
This ONLY happens for one or two different wines, and works in 99.9% of cases.
I have checked the original JSON and there is only a single space in the source data.
Any thoughts on how to diagnose?
It has to do with the way Splunk stores the data in segments. By default, major segmenters include spaces.
A search like field=fu*ar would match events with fubar fuBar fubbbbbar, fu1234bar, etc. but does not match "fun at the bar". This is because the later has several major segmenters in it. To match "fun at the bar" with wild cards you'd need something like this
field=fun* field=*at* field=*the* field=*bar
I seem to remember you can circumvent this with the CASE statement like this field=CASE("fu*bar") as well but then you also have to keep in mind that your search is case sensitive.
It has to do with the way Splunk stores the data in segments. By default, major segmenters include spaces.
A search like field=fu*ar would match events with fubar fuBar fubbbbbar, fu1234bar, etc. but does not match "fun at the bar". This is because the later has several major segmenters in it. To match "fun at the bar" with wild cards you'd need something like this
field=fun* field=*at* field=*the* field=*bar
I seem to remember you can circumvent this with the CASE statement like this field=CASE("fu*bar") as well but then you also have to keep in mind that your search is case sensitive.
Really interesting, that would explain it. What is really strange is that, this morning, statement # 4 now works. Does that mean that when the indexing first occurs and the data is in the hot bucket, it can be different to when it gets rolled to a different bucket?
I have seen this is a very few cases and now I think about it, it could be that the search has failed when I have JUST indexed the data. A day later, can the index/segmenters change in any way?
Believe me, I have been re-running search # 4 and it works every time... Every time I use Splunk the jigsaw gets bigger and another piece of the jigsaw needs to be fit 🙂
Both CASE() and TERM() work, but oddly enough, TERM with variant #2 above does not find it.
Yes absolutely. The data in a hot bucket is in a different format. When it rolls to warm, all sorts of wizardry occurs 🙂
Check this out for a more in depth exploration of the segmenters topic:
https://conf.splunk.com/files/2016/slides/fields-indexed-tokens-and-you.pdf
Do a stats count by Product.Title
and see if there are differences that you can see in those cases. If they are all the same, then you have something strange. If there are differences, then you should be able to discover the cause.
I downvoted this post because upvoted wrong post, sorry
You can simply click on the up arrow to un-upvote. A downvote takes away karma and is generally a bad thing around here unless doing what the author suggests would harm someone's environment.
I upvoted to even out the karma here.
Sorry guys @cpetterborg, not really up on the voting business.
No problem. We're all just here to help out. I'm glad you got an answer to your question.
Thanks, @jkat54!
That's the point, there are no visible differences between any of the wines Given that option 3 works, I can't figure out why Splunk is not finding the result in case 4, as they are essentially the same
cpetterborg was trying to explain how to "discover" the segmenters issue. Usually people find it when they do a stats count by fieldName and have x number less than when they look at the data alone; Yet they know fieldName is in 100% of the events.