Solved: Is there a way to know which fields were extracted...

pduflot · ‎01-07-2016

Hello,

Is there a way to know which fields were extracted at index-time vs search-time?
Is there a search to run or something to look for in the logs?

Thanks

Lowell · ‎01-07-2016

Unfortunately, knowing exactly where a field comes from can be quite difficult to track down.

When you look at the list of fields on the left-hand bar, an individual field could come from any of these:

Native (built-in) fields like _time/source/sourcetype/host
Pseudo indexed fields like index, _cd
Automatically indexed fields (punct, timestartpos, date_hour, ...)
Custom defined index-time fields.
Automatic (structured data) fields use primary for CSV/JSON/XML type sources
Special purpose run-time fields like "splunk_server", "eventtype", and "tag"
Auto extracted fields (key=value)
Custom defined field extractions (KV, delimited, custom regex)
Automatic lookups
calculated fields (EVAL)
field aliases
Possibly others, but I think that's a pretty exhaustive list.

Before structured data extractions you could generally assume that all of the (non-default) fields came from a search-time field extraction (or one of the other search-time methods listed above), but that's not always the case anymore.

So all that to say there's no "easy" answer. I think the best approach is to ask the question one field at at time. You can do that with tstats, because it searches the index directly and therefore will therefore completely ignore search-time extracted fields.

Let's say you suspect that foo is an indexed field. Assuming that foo shows up with the value of bar. So lets just setup a baseline search that will show us how many times "foo" equals "bar" for whatever index and time range your testing.

foo=bar | stats count

Now run the tstats version and see if you get the same results:

| tstats count where foo=bar

Or, another option (if tstats scares you -- I had forgotten that this still works.):

foo::bar | stats count

If you get "0", then the field isn't indexed; so it must be auto extracted or something... the point is that it's happening at search time; not at index time. If both searches return the same count, then you know that "foo" is always an indexed field. (If the numbers are slightly off, it either means that the field is only sometimes indexed, or more likely, it just means that data moved between the time you ran the two searches. (Try using a historic timerange that doesn't go up until "now")

Here's a few other things you can look at when trying to determine if a field is indexed or not:

Check in fields.conf look for stanzas with INDEXED is true. (But this isn't a guarantee.)
You could use walklex to probe individual *.tsidx files in your buckets. (This is very low-level, very tedious unless your a Splunk Ninja; but it's the ultimate source of truth)
Grep all your .conf files at once for the field name in question. Normally returns something relevant, unless your field name is also a commonly occurring term.

View solution in original post

Jason · ‎08-10-2016

Here's an idea, based on Lowell's answer above. Create a list of fields from events ( |stats values(*) as * ) and feed it to map to test whether field::value works - implying it's at least a pseudo-indexed field.

index=youridx 
| dedup 25 sourcetype

get some events, assuming 25 per sourcetype is enough to get all field names with an example

| head 100

assume there are only 4 sourcetypes in the index

| stats first(*) as *

make a list of all fields with their examples (one event, lots of fields)

| fields - date_* tag::*

get rid of some fields we don't care about testing

| transpose

turn those fields into events

| rename "row 1" as row

get rid of that space

|  map maxsearches=20 search="search index=youridx $column$::$row$ | head 1 | eval indexed=\"$column$\" | table indexed"

run up to 20 searches (against the first 20 fields) that go back and search that index for name::value - if found, output an event and populate the field named "indexed" with the field tested. If name::value fails, 0 rows are output, so you end up with a list of likely indexed fields.

From my random dataset, it found index linecount sourcetype tag timeendpos timestartpos, but for some reason did not find punct nor source even though they contained had no spaces. Perhaps due to slashes or other special characters that name::value doesn't like. name::"value" doesn't seem to work at all. So, YMMV.

stratenh · ‎03-17-2023

You can now even do it simpeler:

| tstats 
    [ search index=<your_index> 
    | stats first(*) as * 
    | transpose 
    | fields column 
    | search column!="index" column!="splunk_server" column!="splunk_server_group" 
    | mvcombine column 
    | eval expr="count(".mvjoin(column,"), count(").")" 
    | return $expr] where index=<your_index>
| transpose 
| where 'row 1'>0 
| rex field=column "count\((?<indexed_field_name>.*)\)" 
| fields indexed_field_name

I came to this solution via this thread and https://community.splunk.com/t5/Splunk-Search/How-do-you-prevent-the-map-command-from-encapsulating-... . The 3 values for column needed to be excluded, otherwise tstats tells you it can't do that count function for those fields.

woodcock · ‎01-08-2016

Run your search in fast mode (not verbose, not smart). Whatever fields are present, were added at index time (more or less).

Lowell · ‎01-08-2016

Unfortunately, that's not correct. Most of the fields that show up are in-fact indexed fields (except for splunk_server and linecount), but there's way more indexed fields than what is being shown.) For example, date_*, timestartpos, punct, ...)

I just double checked on a local 6.2 instance running on my laptop. I indexed some IIS data using the structured data extraction, and confirmed that "c_ip" didn't show up on the left, but if I searched for c_ip::1.1.1.1 it did in fact return matching records. (Tested with tstats too, same result)

Fast mode is "fast" because it doesn't bother extracting fields you haven't requested. But Splunk still has to look at both the raw data and the index data to fetch every event. Vs something like tstats which does a pure index-only search never needs to pull in the raw data (and therefore search-time extractions are impossible to perform). Both approaches are faster than the other search modes, but they are very different under the covers.

lguinn2 · ‎01-06-2017

Are the date_* fields index-time fields? Really?

Lowell · ‎01-07-2016

Unfortunately, knowing exactly where a field comes from can be quite difficult to track down.

When you look at the list of fields on the left-hand bar, an individual field could come from any of these:

Native (built-in) fields like _time/source/sourcetype/host
Pseudo indexed fields like index, _cd
Automatically indexed fields (punct, timestartpos, date_hour, ...)
Custom defined index-time fields.
Automatic (structured data) fields use primary for CSV/JSON/XML type sources
Special purpose run-time fields like "splunk_server", "eventtype", and "tag"
Auto extracted fields (key=value)
Custom defined field extractions (KV, delimited, custom regex)
Automatic lookups
calculated fields (EVAL)
field aliases
Possibly others, but I think that's a pretty exhaustive list.

Before structured data extractions you could generally assume that all of the (non-default) fields came from a search-time field extraction (or one of the other search-time methods listed above), but that's not always the case anymore.

So all that to say there's no "easy" answer. I think the best approach is to ask the question one field at at time. You can do that with tstats, because it searches the index directly and therefore will therefore completely ignore search-time extracted fields.

Let's say you suspect that foo is an indexed field. Assuming that foo shows up with the value of bar. So lets just setup a baseline search that will show us how many times "foo" equals "bar" for whatever index and time range your testing.

foo=bar | stats count

Now run the tstats version and see if you get the same results:

| tstats count where foo=bar

Or, another option (if tstats scares you -- I had forgotten that this still works.):

foo::bar | stats count

If you get "0", then the field isn't indexed; so it must be auto extracted or something... the point is that it's happening at search time; not at index time. If both searches return the same count, then you know that "foo" is always an indexed field. (If the numbers are slightly off, it either means that the field is only sometimes indexed, or more likely, it just means that data moved between the time you ran the two searches. (Try using a historic timerange that doesn't go up until "now")

Here's a few other things you can look at when trying to determine if a field is indexed or not:

Check in fields.conf look for stanzas with INDEXED is true. (But this isn't a guarantee.)
You could use walklex to probe individual *.tsidx files in your buckets. (This is very low-level, very tedious unless your a Splunk Ninja; but it's the ultimate source of truth)
Grep all your .conf files at once for the field name in question. Normally returns something relevant, unless your field name is also a commonly occurring term.

pduflot · ‎01-08-2016

Thanks for this complete answer. Using the walklex tool, I could confirm that I still have a bunch of field that should not be extracted at index-time. I just need to find why now...

Lowell · ‎01-08-2016

Hey, just found out that Splunk revived their field::value search syntax (which I think was disabled) back in Splunk 5, but I guess made it's way back into Splunk 6.

javiergn · ‎01-07-2016

Index time: default fields (timestamp, punct, host, source, and sourcetype) + whatever you configure to be extracted during index time
Search time: whatever you configure via the GUI + props/transforms configured for search time extraction

Take a look at this too: http://docs.splunk.com/Documentation/Splunk/6.3.2/Indexer/Indextimeversussearchtime

Is there a way to know which fields were extracted at index-time vs search-time?

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers

Are you a member of the Splunk Community?

Is there a way to know which fields were extracted at index-time vs search-time?

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers