I'm double posting, original issue posted here: http://www.splunk.com/support/forum:SplunkGeneral/4378
When I use double-quotes in my index-time field extractions, the meta-data is not searchable. I've seen this problem on 4.0.11 and 4.1.3.
Sample text:
results=AA,BB,CC CC,DD
Transforms.conf without double-quotes:
REGEX = ^results=(.*?),(.*?),(.*?),(.+)$
FORMAT = key1::$1 key2::$2 key3::$3 key4::$4
WRITE_META = true
Transform.conf with double-quotes:
REGEX = ^results=(.*?),(.*?),(.*?),(.+)$
FORMAT = key1::"$1" key2::"$2" key3::"$3" key4::"$4"
WRITE_META = true
Results:
If you use the first transforms.conf without the double-quotes, there are two problems:
The value for key3 (with a space) is not captured correctly. This is in the documentation which says to use double-quotes.
The fields extracted on 4.1.3 are incorrect for key4. Instead of having a field "key4" it has "CC key4". I don't recall seeing this behavior in 4.0.x.
However, if you use the second transforms.conf with the double-quotes:
UPDATE 6/15/2010
Here are my conf files so you can replicate this issue. I also have a screenshot below.
inputs.conf:
[monitor:///var/log/test]
disabled = 0
sourcetype = mytest
props.conf:
[mytest]
TRANSFORMS-test = extract-fields
fields.conf:
[key1]
INDEXED = true
[key2]
INDEXED = true
[key3]
INDEXED = true
[key4]
INDEXED = true
transforms.conf:
[extract-fields]
REGEX = ^results=(.*?),(.*?),(.*?),(.+)$
FORMAT = key1::"$1" key2::"$2" key3::"$3" key4::"$4"
WRITE_META = true
screenshot:
In this screenshot, notice that the values are indeed extracted and show up in the search result. However, searching for "key1=AA" (or any other key=value) returns no results.
I'm experiencing exactly the same problem (with a similar setup for extracting an indexed field and then removing that text from _raw after indexing; and yes, I have INDEXED_VALUE=false). I am running 4.1.3.
Double quotes in the transform (eg, FORMAT=fieldname::"$1") preserves extracted field values having a space, and I can see the correct values listed in the metadata under the event. But filtering on any of these values (eg, by clicking the value in the event's metadata, or choosing it from the field list on the left, both of which add fieldname="value" to the search) fails. It fails whether or not there's a space in the value.
Removing the quotes from the transform (eg, FORMAT=fieldname::$1) makes the searching/filtering work as expected. But extracted field values that should include a space are instead truncated at the space.
What I've noticed that goes beyond the discussion above is that in situation 1., if you include a * in the filtering term, eg, fieldname="value*", the search will succeed. I've not found a literal character I can put in that final position other than the * and have the search succeed.
And because it's customary at this point to be asked why one is indexing and modifying _raw: I'm wanting to associate additional metadata with logfile lines and other event text I'm streaming via TCP from a large number of sources (I want to record serial#, model#, and software version). If the ***SPLUNK*** header trick would work for custom indexed fields instead of only source, sourcetype, and host, I would put these values there and we'd be done. Instead, I append the metadata to each logline like this: ***META*** serial=ABCDE model=FGH version="1.1e" and I have a transform that removes ***META*** and everything after it after the indexing transform has been invoked. It would be wrong to leave the original line all mangled, so search time extraction is no good here.
I saw mention elsewhere that the ***SPLUNK*** header feature had fallen out of favor and wasn't being tended to. It would be great if this limitation could be addressed, especially since the metadata would only need to appear once at the top of a logfile stream rather than being bolted onto each line.
By looking at the screenshot it appears that the raw event is not being modified. Dotom, can you clear this up? Are you modifying the raw event after the transform that indexes your fields?
Summary answer:
INDEXED_VALUE = false
for your indexed fields if the value is not in the raw event text. According to the config posted, it appeared as if the indexed values would appear in the raw text, since that is where they were being extracted from in the first place. However, a comment indicated that the raw text was subsequently being transformed to remove those field values. (This is why the fields needed to be indexed in the first place. If they had remained in the raw text, it would probably have been better to use search-time extractions instead.)Update:
INDEXED_VALUE = true
(the default), then a search for key1="val1"
or key::val1
is treated as a search for "val1" AND key1="val1")
, i.e., the token val1
must occur in the raw text, and the field key1
must have the value val1
. This is usually true for search-time fields.So my speculation is that key1::value1 might have worked for the original questioner, but perhaps the GUI rewrote it to key="value1", which would not. This could be tested with a CLI search.
You're right. I never use ::
for queries, and if you type it into the GUI on the dashboard, it changes it to =
, so I've never noticed. Could be that it would work okay.
Gerald, can you confirm that "::" and "=" behave the same way in 4.[01]? My understanding is that ::
only operates on indexed fields in 4.x (which is different than it used to be.) For a simple test I tried the search "sourcetype=syslog pid=7482" vs "sourcetype=syslog pid::7482" The first search returned results whereas the 2nd does not. (Because pid
is an extracted field and not an indexed field.)
Please see my comment below. You need to set INDEXED_VALUE = false
in fields.conf. This is so because you are apparently modifying the _raw field value. This wasn't mentioned above, but if I am understanding your comment below, that seems to be what is going on here.
Is your literal search:
"key1=AA"
Or, do you mean:
key1=AA
Because the first should fail because such a term (key1
) does not exist within your actual raw event (based on your provided sample event). However, the second should work if key1
is setup as an INDEXED_FIELD
in fields.conf
.
You could try searching for your indexed field explicitly, like so:
key1::AA
The ::
will force 'key1' to be looked up via your indexed field and not using an extracted (search-time) field.
BTW. One useful tool I've found for tracking down index field issues is using the walklex
command line tool. You have to drill down into your index's hot bucket and point to one of your .tsidx
files. (There's some guess work / trail-n-error involved with finding the right file.) You can search a single .tsidx
file for an indexed term (or an indexed field). Here is an example from my system looking for the date_hour
indexed field:
walklex 1268486967-1266586961-302021.tsidx 'date_hour::*'
You may be able to use this approach to see if there is an index-level different between how these your indexed fields were stored in your index with previous versions versus now. If this turns out to be some kind of bug in splunk, then this information could be quite valuable.
Another approach to debug indexed fields is to export some data from one of your buckets to a csv
file using exporttool
like:
exporttool /path/to/your/bucket /tmp/exportfile.csv -csv meta::all
You can then open up the exported file and review the "_meta" column and see how splunk is storing your indexed fields. Again, you can use this to compare before/after your most recent upgrade. (You can use a better search to export just the relevant events by simply replacing "meta::all" with a sourcetype search, for example.)
Out of curiosity, what's the reason why you are using indexed fields instead of extracted fields?
If you transform your raw data at index time to remove the field values, then no, they will never make it to the index. All transforms change the data before it is indexed.
Changing fields.conf to indexed = false fixes it for the second example which solves my problem. In my particular case, since I use a transform to perform data masking, I am not sure if the actual key=value pairs are written to the index or not. I will run an export to verify. The behavior of the first example in my original post is interesting though, because in theory it should not work if indexed=true for fields.conf?
So, it's still not clear to me: You are modifying the _raw text that is indexed after you do the index-time field transforms, i.e., doing a transform on _raw after creating key1, key2, key3, etc? Is that right? If so, then you need to set INDEXED_VALUE = false
for all your fields. Since the values of key1, key2, etc are no longer in the _raw text, and INDEXED_VALUE=true
(the default) requires any field value to be in the raw, your searches will never return results. (Except maybe by luck.)
To answer your other question "why are you using indexed fields instead of extracted fields". I need to inject values to the original syslog message and have those meta-values searchable via key/value pairs. When a user runs a query, the user needs to see only the original syslog message, not the meta-data I've injected. For example, I want a user to be able to run a query "event=logon result=failure user=lowell" and it will pull up all failed logons across 100 different platforms. Note this works fine if I don't use quotes in transforms.conf, but only for values that don't have spaces.
Question edited. I couldn't add the image tag directly but have link to the screenshot.
As for search-time field extractions, by definition, I think if you search time extracted the field you would always be able to search for it. The bug here I believe is with index-time extraction only where the values are written and shows up in search results (and on left hand side blue fields panel), they just aren't searchable.
Please edit your question and include the corresponding props.conf
entries. I'm wondering if you are really dealing with indexed fields or if you have search-time extracted fields (sometime people don't understand the difference, and posting your props.conf
will clear this up.)
That doesn't work either. Any combination of the following queries all fail when using double-quotes in transforms.conf to define the value of the index-time extraction:
key1=AA
key1="AA"
key1::AA
key1::"AA"
The values are clearly extracted as you can see the meta-data defined in the fields that show right below the text message.
Curious - do you have these keys defined in fields.conf? You shouldn't need the quotes in transforms.conf, I'm unsure what that is supposed to achieve, but I assume it works for you in earlier versions?
What does your props.conf look like?
I agree it seems like it should be mentioned both places. Send a note to docs@splunk.com mentioning this. (I don't work for splunk so I can't do anything about it short of emailing this in myself.)
Thanks, Lowell. Since it is valid, can this::"$1" syntax (with quotes) appear in the spec for transforms.conf? It'd be good to make it clear in both places on the docs...
I do get search results if I do not put the backreferenced values in quotes. The problem I have is I want to use quotes because that's the correct way to capture the values (with spaces in them), but then I have the other problem I listed in the original post in that those fields extracted at index-time are not searchable.
I agree that the docs can be sparse at times, but this one is documented. See http://www.splunk.com/base/Documentation/latest/Admin/Configureindex-timefieldextraction. So, yes you should be using quotes here.