Splunk Search

Why is index-time field extraction not searchable?

dottom
Path Finder

I'm double posting, original issue posted here: http://www.splunk.com/support/forum:SplunkGeneral/4378

When I use double-quotes in my index-time field extractions, the meta-data is not searchable. I've seen this problem on 4.0.11 and 4.1.3.

Sample text:

results=AA,BB,CC CC,DD

Transforms.conf without double-quotes:

REGEX = ^results=(.*?),(.*?),(.*?),(.+)$
FORMAT = key1::$1 key2::$2 key3::$3 key4::$4
WRITE_META = true

Transform.conf with double-quotes:

REGEX = ^results=(.*?),(.*?),(.*?),(.+)$
FORMAT = key1::"$1" key2::"$2" key3::"$3" key4::"$4"
WRITE_META = true

Results:

If you use the first transforms.conf without the double-quotes, there are two problems:

  • The value for key3 (with a space) is not captured correctly. This is in the documentation which says to use double-quotes.

  • The fields extracted on 4.1.3 are incorrect for key4. Instead of having a field "key4" it has "CC key4". I don't recall seeing this behavior in 4.0.x.

However, if you use the second transforms.conf with the double-quotes:

  • The meta-data is not searchable, i.e. search for "key1=AA" fails.


UPDATE 6/15/2010

Here are my conf files so you can replicate this issue. I also have a screenshot below.

inputs.conf:

[monitor:///var/log/test]
disabled = 0
sourcetype = mytest

props.conf:

[mytest]
TRANSFORMS-test = extract-fields

fields.conf:

[key1]
INDEXED = true

[key2]
INDEXED = true

[key3]
INDEXED = true

[key4]
INDEXED = true

transforms.conf:

[extract-fields]
REGEX = ^results=(.*?),(.*?),(.*?),(.+)$
FORMAT = key1::"$1" key2::"$2" key3::"$3" key4::"$4"
WRITE_META = true

screenshot:

In this screenshot, notice that the values are indeed extracted and show up in the search result. However, searching for "key1=AA" (or any other key=value) returns no results.

http://dottom.com/public/images/screenshot_8jd49x4d.png

Tags (2)

welchatquietple
Engager

I'm experiencing exactly the same problem (with a similar setup for extracting an indexed field and then removing that text from _raw after indexing; and yes, I have INDEXED_VALUE=false). I am running 4.1.3.

  1. Double quotes in the transform (eg, FORMAT=fieldname::"$1") preserves extracted field values having a space, and I can see the correct values listed in the metadata under the event. But filtering on any of these values (eg, by clicking the value in the event's metadata, or choosing it from the field list on the left, both of which add fieldname="value" to the search) fails. It fails whether or not there's a space in the value.

  2. Removing the quotes from the transform (eg, FORMAT=fieldname::$1) makes the searching/filtering work as expected. But extracted field values that should include a space are instead truncated at the space.

  3. What I've noticed that goes beyond the discussion above is that in situation 1., if you include a * in the filtering term, eg, fieldname="value*", the search will succeed. I've not found a literal character I can put in that final position other than the * and have the search succeed.

And because it's customary at this point to be asked why one is indexing and modifying _raw: I'm wanting to associate additional metadata with logfile lines and other event text I'm streaming via TCP from a large number of sources (I want to record serial#, model#, and software version). If the ***SPLUNK*** header trick would work for custom indexed fields instead of only source, sourcetype, and host, I would put these values there and we'd be done. Instead, I append the metadata to each logline like this: ***META*** serial=ABCDE model=FGH version="1.1e" and I have a transform that removes ***META*** and everything after it after the indexing transform has been invoked. It would be wrong to leave the original line all mangled, so search time extraction is no good here.

I saw mention elsewhere that the ***SPLUNK*** header feature had fallen out of favor and wasn't being tended to. It would be great if this limitation could be addressed, especially since the metadata would only need to appear once at the top of a logfile stream rather than being bolted onto each line.

Lowell
Super Champion

By looking at the screenshot it appears that the raw event is not being modified. Dotom, can you clear this up? Are you modifying the raw event after the transform that indexes your fields?

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Summary answer:

  • You do need double quotes around field values if the value might contain spaces.
  • You need to set INDEXED_VALUE = false for your indexed fields if the value is not in the raw event text. According to the config posted, it appeared as if the indexed values would appear in the raw text, since that is where they were being extracted from in the first place. However, a comment indicated that the raw text was subsequently being transformed to remove those field values. (This is why the fields needed to be indexed in the first place. If they had remained in the raw text, it would probably have been better to use search-time extractions instead.)

Update:

  • It's because if INDEXED_VALUE = true (the default), then a search for key1="val1" or key::val1 is treated as a search for "val1" AND key1="val1"), i.e., the token val1 must occur in the raw text, and the field key1 must have the value val1. This is usually true for search-time fields.

gkanapathy
Splunk Employee
Splunk Employee

So my speculation is that key1::value1 might have worked for the original questioner, but perhaps the GUI rewrote it to key="value1", which would not. This could be tested with a CLI search.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

You're right. I never use :: for queries, and if you type it into the GUI on the dashboard, it changes it to =, so I've never noticed. Could be that it would work okay.

0 Karma

Lowell
Super Champion

Gerald, can you confirm that "::" and "=" behave the same way in 4.[01]? My understanding is that :: only operates on indexed fields in 4.x (which is different than it used to be.) For a simple test I tried the search "sourcetype=syslog pid=7482" vs "sourcetype=syslog pid::7482" The first search returned results whereas the 2nd does not. (Because pid is an extracted field and not an indexed field.)

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Please see my comment below. You need to set INDEXED_VALUE = false in fields.conf. This is so because you are apparently modifying the _raw field value. This wasn't mentioned above, but if I am understanding your comment below, that seems to be what is going on here.

0 Karma

Lowell
Super Champion

Is your literal search:

"key1=AA"  

Or, do you mean:

key1=AA

Because the first should fail because such a term (key1) does not exist within your actual raw event (based on your provided sample event). However, the second should work if key1 is setup as an INDEXED_FIELD in fields.conf.

You could try searching for your indexed field explicitly, like so:

 key1::AA

The :: will force 'key1' to be looked up via your indexed field and not using an extracted (search-time) field.

BTW. One useful tool I've found for tracking down index field issues is using the walklex command line tool. You have to drill down into your index's hot bucket and point to one of your .tsidx files. (There's some guess work / trail-n-error involved with finding the right file.) You can search a single .tsidx file for an indexed term (or an indexed field). Here is an example from my system looking for the date_hour indexed field:

walklex 1268486967-1266586961-302021.tsidx 'date_hour::*'

You may be able to use this approach to see if there is an index-level different between how these your indexed fields were stored in your index with previous versions versus now. If this turns out to be some kind of bug in splunk, then this information could be quite valuable.

Another approach to debug indexed fields is to export some data from one of your buckets to a csv file using exporttool like:

exporttool /path/to/your/bucket /tmp/exportfile.csv -csv meta::all

You can then open up the exported file and review the "_meta" column and see how splunk is storing your indexed fields. Again, you can use this to compare before/after your most recent upgrade. (You can use a better search to export just the relevant events by simply replacing "meta::all" with a sourcetype search, for example.)

Out of curiosity, what's the reason why you are using indexed fields instead of extracted fields?

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

If you transform your raw data at index time to remove the field values, then no, they will never make it to the index. All transforms change the data before it is indexed.

0 Karma

dottom
Path Finder

Changing fields.conf to indexed = false fixes it for the second example which solves my problem. In my particular case, since I use a transform to perform data masking, I am not sure if the actual key=value pairs are written to the index or not. I will run an export to verify. The behavior of the first example in my original post is interesting though, because in theory it should not work if indexed=true for fields.conf?

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

So, it's still not clear to me: You are modifying the _raw text that is indexed after you do the index-time field transforms, i.e., doing a transform on _raw after creating key1, key2, key3, etc? Is that right? If so, then you need to set INDEXED_VALUE = false for all your fields. Since the values of key1, key2, etc are no longer in the _raw text, and INDEXED_VALUE=true (the default) requires any field value to be in the raw, your searches will never return results. (Except maybe by luck.)

0 Karma

dottom
Path Finder

To answer your other question "why are you using indexed fields instead of extracted fields". I need to inject values to the original syslog message and have those meta-values searchable via key/value pairs. When a user runs a query, the user needs to see only the original syslog message, not the meta-data I've injected. For example, I want a user to be able to run a query "event=logon result=failure user=lowell" and it will pull up all failed logons across 100 different platforms. Note this works fine if I don't use quotes in transforms.conf, but only for values that don't have spaces.

0 Karma

dottom
Path Finder

Question edited. I couldn't add the image tag directly but have link to the screenshot.

As for search-time field extractions, by definition, I think if you search time extracted the field you would always be able to search for it. The bug here I believe is with index-time extraction only where the values are written and shows up in search results (and on left hand side blue fields panel), they just aren't searchable.

0 Karma

Lowell
Super Champion

Please edit your question and include the corresponding props.conf entries. I'm wondering if you are really dealing with indexed fields or if you have search-time extracted fields (sometime people don't understand the difference, and posting your props.conf will clear this up.)

0 Karma

dottom
Path Finder

That doesn't work either. Any combination of the following queries all fail when using double-quotes in transforms.conf to define the value of the index-time extraction:

key1=AA
key1="AA"
key1::AA
key1::"AA"

The values are clearly extracted as you can see the meta-data defined in the fields that show right below the text message.

0 Karma

parallaxed
Path Finder

Curious - do you have these keys defined in fields.conf? You shouldn't need the quotes in transforms.conf, I'm unsure what that is supposed to achieve, but I assume it works for you in earlier versions?

What does your props.conf look like?

0 Karma

Lowell
Super Champion

I agree it seems like it should be mentioned both places. Send a note to docs@splunk.com mentioning this. (I don't work for splunk so I can't do anything about it short of emailing this in myself.)

0 Karma

parallaxed
Path Finder

Thanks, Lowell. Since it is valid, can this::"$1" syntax (with quotes) appear in the spec for transforms.conf? It'd be good to make it clear in both places on the docs...

0 Karma

dottom
Path Finder

I do get search results if I do not put the backreferenced values in quotes. The problem I have is I want to use quotes because that's the correct way to capture the values (with spaces in them), but then I have the other problem I listed in the original post in that those fields extracted at index-time are not searchable.

0 Karma

Lowell
Super Champion

I agree that the docs can be sparse at times, but this one is documented. See http://www.splunk.com/base/Documentation/latest/Admin/Configureindex-timefieldextraction. So, yes you should be using quotes here.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...