Hello,
Is there a way to know which fields were extracted at index-time vs search-time?
Is there a search to run or something to look for in the logs?
Thanks
Unfortunately, knowing exactly where a field comes from can be quite difficult to track down.
When you look at the list of fields on the left-hand bar, an individual field could come from any of these:
Before structured data extractions you could generally assume that all of the (non-default) fields came from a search-time field extraction (or one of the other search-time methods listed above), but that's not always the case anymore.
So all that to say there's no "easy" answer. I think the best approach is to ask the question one field at at time. You can do that with tstats
, because it searches the index directly and therefore will therefore completely ignore search-time extracted fields.
Let's say you suspect that foo
is an indexed field. Assuming that foo shows up with the value of bar
. So lets just setup a baseline search that will show us how many times "foo" equals "bar" for whatever index and time range your testing.
foo=bar | stats count
Now run the tstats version and see if you get the same results:
| tstats count where foo=bar
Or, another option (if tstats
scares you -- I had forgotten that this still works.):
foo::bar | stats count
If you get "0", then the field isn't indexed; so it must be auto extracted or something... the point is that it's happening at search time; not at index time. If both searches return the same count, then you know that "foo" is always an indexed field. (If the numbers are slightly off, it either means that the field is only sometimes indexed, or more likely, it just means that data moved between the time you ran the two searches. (Try using a historic timerange that doesn't go up until "now")
Here's a few other things you can look at when trying to determine if a field is indexed or not:
fields.conf
look for stanzas with INDEXED
is true. (But this isn't a guarantee.)walklex
to probe individual *.tsidx
files in your buckets. (This is very low-level, very tedious unless your a Splunk Ninja; but it's the ultimate source of truth).conf
files at once for the field name in question. Normally returns something relevant, unless your field name is also a commonly occurring term.Here's an idea, based on Lowell's answer above. Create a list of fields from events ( |stats values(*) as *
) and feed it to map
to test whether field::value
works - implying it's at least a pseudo-indexed field.
index=youridx
| dedup 25 sourcetype
get some events, assuming 25 per sourcetype is enough to get all field names with an example
| head 100
assume there are only 4 sourcetypes in the index
| stats first(*) as *
make a list of all fields with their examples (one event, lots of fields)
| fields - date_* tag::*
get rid of some fields we don't care about testing
| transpose
turn those fields into events
| rename "row 1" as row
get rid of that space
| map maxsearches=20 search="search index=youridx $column$::$row$ | head 1 | eval indexed=\"$column$\" | table indexed"
run up to 20 searches (against the first 20 fields) that go back and search that index for name::value
- if found, output an event and populate the field named "indexed" with the field tested. If name::value
fails, 0 rows are output, so you end up with a list of likely indexed fields.
From my random dataset, it found index
linecount
sourcetype
tag
timeendpos
timestartpos
, but for some reason did not find punct
nor source
even though they contained had no spaces. Perhaps due to slashes or other special characters that name::value
doesn't like. name::"value"
doesn't seem to work at all. So, YMMV.
You can now even do it simpeler:
| tstats
[ search index=<your_index>
| stats first(*) as *
| transpose
| fields column
| search column!="index" column!="splunk_server" column!="splunk_server_group"
| mvcombine column
| eval expr="count(".mvjoin(column,"), count(").")"
| return $expr] where index=<your_index>
| transpose
| where 'row 1'>0
| rex field=column "count\((?<indexed_field_name>.*)\)"
| fields indexed_field_name
I came to this solution via this thread and https://community.splunk.com/t5/Splunk-Search/How-do-you-prevent-the-map-command-from-encapsulating-... . The 3 values for column needed to be excluded, otherwise tstats tells you it can't do that count function for those fields.
Run your search in fast mode
(not verbose
, not smart
). Whatever fields are present, were added at index time (more or less).
Unfortunately, that's not correct. Most of the fields that show up are in-fact indexed fields (except for splunk_server and linecount), but there's way more indexed fields than what is being shown.) For example, date_*, timestartpos, punct, ...)
I just double checked on a local 6.2 instance running on my laptop. I indexed some IIS data using the structured data extraction, and confirmed that "c_ip" didn't show up on the left, but if I searched for c_ip::1.1.1.1
it did in fact return matching records. (Tested with tstats too, same result)
Fast mode is "fast" because it doesn't bother extracting fields you haven't requested. But Splunk still has to look at both the raw data and the index data to fetch every event. Vs something like tstats
which does a pure index-only search never needs to pull in the raw data (and therefore search-time extractions are impossible to perform). Both approaches are faster than the other search modes, but they are very different under the covers.
Are the date_* fields index-time fields? Really?
Unfortunately, knowing exactly where a field comes from can be quite difficult to track down.
When you look at the list of fields on the left-hand bar, an individual field could come from any of these:
Before structured data extractions you could generally assume that all of the (non-default) fields came from a search-time field extraction (or one of the other search-time methods listed above), but that's not always the case anymore.
So all that to say there's no "easy" answer. I think the best approach is to ask the question one field at at time. You can do that with tstats
, because it searches the index directly and therefore will therefore completely ignore search-time extracted fields.
Let's say you suspect that foo
is an indexed field. Assuming that foo shows up with the value of bar
. So lets just setup a baseline search that will show us how many times "foo" equals "bar" for whatever index and time range your testing.
foo=bar | stats count
Now run the tstats version and see if you get the same results:
| tstats count where foo=bar
Or, another option (if tstats
scares you -- I had forgotten that this still works.):
foo::bar | stats count
If you get "0", then the field isn't indexed; so it must be auto extracted or something... the point is that it's happening at search time; not at index time. If both searches return the same count, then you know that "foo" is always an indexed field. (If the numbers are slightly off, it either means that the field is only sometimes indexed, or more likely, it just means that data moved between the time you ran the two searches. (Try using a historic timerange that doesn't go up until "now")
Here's a few other things you can look at when trying to determine if a field is indexed or not:
fields.conf
look for stanzas with INDEXED
is true. (But this isn't a guarantee.)walklex
to probe individual *.tsidx
files in your buckets. (This is very low-level, very tedious unless your a Splunk Ninja; but it's the ultimate source of truth).conf
files at once for the field name in question. Normally returns something relevant, unless your field name is also a commonly occurring term.Thanks for this complete answer. Using the walklex tool, I could confirm that I still have a bunch of field that should not be extracted at index-time. I just need to find why now...
Hey, just found out that Splunk revived their field::value
search syntax (which I think was disabled) back in Splunk 5, but I guess made it's way back into Splunk 6.
Index time: default fields (timestamp, punct, host, source, and sourcetype) + whatever you configure to be extracted during index time
Search time: whatever you configure via the GUI + props/transforms configured for search time extraction
Take a look at this too: http://docs.splunk.com/Documentation/Splunk/6.3.2/Indexer/Indextimeversussearchtime