Greetings, I apologize in advance for the long post.
Problem abstract: field discovery and extract work great, but searching on extracted fields gives weird results.
Input stream: single-line events made up of unordered keyword/value pairs. Each pair has format KEYWORD^VALUE, where "^" is the K=V separator. Pairs are delimited by 0x1F. One example event is below:
action_forunit^n/a\x1Faction_type^action_execute\x1Fresource_id^ELFVIEW\x1FappId^ERD\x1Fresource_currency^USD\x1FcorrId^0\x1FtimeStamp^1378492210757\x1FeventType^3000\x1Faction_foruser^InformDeveloper\x1Fhostname^rsomdavecs01\x1Fresource_amount^0.0\x1Faction_forcustomer^n/a\x1Faction_foraccount^n/a\x1Faction_forregion^n/a\x1Faction_forgroup^unknown\x1Fresource_info^n/a\x1Fresult_info^n/a\x1Faudit_level^1\x1Faction_info^Performing doGet() of the MainServlet\x1FcomponentId^ELFView\x1Fresource_name^ELF View Web Application\x1FsessionId^N/A\x1Fresource_idtype^product_code\x1Fresult_type^result_success\x1Fresource_type^resource_product\x1F
Note: The sequence "\x1F" above represents a single byte 0x1F, as represented in Splunk search, but have verified actual 0x0f using hex dump.
Event breaking and time stamping as follows:
default/props.conf
[ELFDATA]
NO_BINARY_CHECK=1
SHOULD_LINEMERGE=false
TIME_PREFIX=timeStamp\^
pulldown_type=1
Field extractions as follows:
local/props.conf
[ELFDATA]
REPORT-ELFKV = ELFKV
local/transforms.conf
[ELFKV]
CLEAN_KEYS = 1
FORMAT = $1::$2
MV_ADD = 0
REGEX = ([^\^]+?)\^([^\x1f]+?)[\x1f]
For the sample data above, field "appId" is extracted and assigned value “ERD”. But a search using appId="ERD" returns no results. Also, certain wildcard searches (appId=*, appId=*ERD, appId="E*"
) work, while others (appId="ERD*", appId="ER*"
) don't.
Makes no sense that I can see.
Partial workaround by piping search results to a subsequent search, as:
<first search> | search appId="ERD"
Works (though I wish someobody would tell me why), but fails when a new search is generated, as in auto-drilldown.
Like I said, sorry for the length of the post. Brevity was never my strong suit.
R.Turk has mostly the right answer. To be very specific, it's because of the way Splunk tokenizes words that go into the index, and how it searches on fields. Basically, because your field values are not tokens, Splunk doesn't store or find the value in the index. I don't want to go too much into tokenization, but basically, Splunk will only create tokens (which is just what we call "words in the index") based on certain character breaks. It happens that with your data, the default settings in Splunk are indexed as a single word/token
Search on field values in turn searches the index for the field values first, then extracts, then validates that the extracted fields have the right values. Because they aren't there, you won't return the results in the first search.
You have two good ways to fix this:
INDEXED_VALUE = false
simply tells Splunk to not look for the value in the index, but to return all events that otherwise match, extract, and then filter out values. Unfortunately, given how your data is tokenized, you're not going to be able to get a much better way, so that will work for you. You could use INDEXED_VALUE=*<VALUE>*
, but I don't believe it will behave or perform differently.You have one rather risky and experimental way to fix this, which to be honest, I'm not sure will work, but if it does it would perform a lot better:
create a custom stanza in a segmenters.conf file:
[my_custom_segmentation]
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + ^ \x1f
MINOR = / : = @ . - $ # % \\ _
basically, you're adding ^
and 0x1f as major breakers in your data. The questionable part is, I don't know what the correct syntax for including 0x1f actually is, so I'm guessing. I'm not even sure it's possible.
set up your indexed data sourcetype to use it in props.conf, adding this to the stanza for the sourcetype on your indexers:
[my_sourcetype]
SEGMENTATION = my_custom_segmentation
Note that any changes to indexing properties (including either indexing fields or modifying index segmentation) would require data to be reindexed to have proper effect.
R.Turk has mostly the right answer. To be very specific, it's because of the way Splunk tokenizes words that go into the index, and how it searches on fields. Basically, because your field values are not tokens, Splunk doesn't store or find the value in the index. I don't want to go too much into tokenization, but basically, Splunk will only create tokens (which is just what we call "words in the index") based on certain character breaks. It happens that with your data, the default settings in Splunk are indexed as a single word/token
Search on field values in turn searches the index for the field values first, then extracts, then validates that the extracted fields have the right values. Because they aren't there, you won't return the results in the first search.
You have two good ways to fix this:
INDEXED_VALUE = false
simply tells Splunk to not look for the value in the index, but to return all events that otherwise match, extract, and then filter out values. Unfortunately, given how your data is tokenized, you're not going to be able to get a much better way, so that will work for you. You could use INDEXED_VALUE=*<VALUE>*
, but I don't believe it will behave or perform differently.You have one rather risky and experimental way to fix this, which to be honest, I'm not sure will work, but if it does it would perform a lot better:
create a custom stanza in a segmenters.conf file:
[my_custom_segmentation]
MAJOR = [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + ^ \x1f
MINOR = / : = @ . - $ # % \\ _
basically, you're adding ^
and 0x1f as major breakers in your data. The questionable part is, I don't know what the correct syntax for including 0x1f actually is, so I'm guessing. I'm not even sure it's possible.
set up your indexed data sourcetype to use it in props.conf, adding this to the stanza for the sourcetype on your indexers:
[my_sourcetype]
SEGMENTATION = my_custom_segmentation
Note that any changes to indexing properties (including either indexing fields or modifying index segmentation) would require data to be reindexed to have proper effect.
kscher, short answer is that Splunk uses certain characters to decide what a word boundary is, and usually spaces and punctuation work fine. but when you have weird characters separating words, the words won't get indexed the way you need for a search to work. the custom segmentation simply tells Splunk to split words so they can be searched for and found.
extraction works independently of this, because extraction happens after events are found, thus it may appear to work even if the field values that get extracted can not be found in the index.
I need to +1 for Gerald's experimental answer here. A customer of mine ran into the same type of thing today with single-line events with unusual delimiters, and the custom segmentation fix worked great. THANK YOU.
I tried the custom segmentation fix and it worked like a charm. Great thanks to gkanpathy and mturk for their generous help. Would still love to have more insight into the - to me - unpredictable search behavior that prompted the initial post.
If this works - leaving aside efficiency questions - it looks like it would also allow us to parse any one of the kv pairs without having to explicitly define the field in fields.conf, which would really help as there are dozens of them in these events. I'll post back with results and thanks to both of you for your generous help.
Thanks for elaborating - good to know I was on the right path.
Hi Kscher,
First I'll be honest... I did not read everything you wrote (sorry).
Right... with that out of the way, I believe you'll find your issue is with the non-standard event boundaries in your data. I had something similar with some field extractions I did a while back where I had some entries like this:
QBRILE9801
QBRMHE9831
QPTAAE2151
QWYNME9911
...
Where the string ranging from the 2nd to the 5th string would be the exchange_id
field which I needed to extract (e.g. BRIL
, BRMH
, PTAA
, etc.).
"Regular expressions to the rescue!" I hear you say... well I said that too, except when I went to run searches & populate tables with my shiny new exchange_id
field, I was getting zero results, while getting inconsistent results when I started throwing in asterisks into the equation like a ninja would throw shuriken... but that makes for a sloppy ninja.
So along comes fields.conf
. I had never used it before, and to be honest with you I'm probably not 100% on it's use, but when I made the following fields.conf
...
[exchange_id]
INDEXED_VALUE = false
...it miraculously started working the way I expected.
So yeah... probably not much of an answer, but hopefully something to get you on the right track 🙂
Reference:
Cheers & Beers,
RT
Have finally had a chance to test your solution, and it works just great. Thanks v much!
Since this is a miracle we shouldn't expect to find out why it works, much less why we needed a miracle in the first place, but I sure would like to know.
Very grateful for this lead. I'll post pack on results.