Hello all,
I am trying to understand the type of fields command.
Documentation says it is a "distributable streaming" which means it can be run on the indexer, which improves processing time.
If I use fields command to specify fields which are extracted in the search head (using field discovery for example) , how can it still considered as distributable streaming?
If I am not mistaken, field extraction on the indexers is possible using rex command or with indexed fields.
Thank you in advance!
Yup. As @richgalloway hinted at - Splunk uses a two-tiered search process.
Simplifying a bit (and not taking into account initial commands which run on SH), the search is initiated on SH. SH breaks it down into phase1 and phase2.
Phase1 is spawned from SH to indexers (let's not dig deeply into which indexers the search is spawned to; it's a topic for another time). The indexer(s) have a so-called knowledge bundle which contains search-time settings replicated from the SH (again - how it's happening is another topic). So the indexers know how fields are extracted. And they extract those fields if needed.
Phase1 contains only an initial events search or distributed streaming commands because each indexer processes its data independently and cannot rely on events held elsewhere. And it ends either by simply passing the results back to SH for displaying (if there are no more commands in the search pipeline or next command is centralized streaming one) or ends with the map part of a transforming or dataset processing command which prepare the results for aggregation by the SH.
Next the intermediate results are gathered by SH which performs phase2 of the search.
Phase2 can contain any type of commands, phase1 can only contain the initial search, distributable streaming commands or the "prestats" part of transforming or dataset processing command.
So everytime you use a command which is not a distributable streaming command in your search after the initial search, the processing is moved at this point to the SH tier and you lose the concurrent processing. Therefore it's better to use fields command than table unless you're at the end of your search and want to format your data pretty for viewing 😉
Here is one docs page which told how those steps are done and what are order of those https://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Searchtimeoperationssequence.
You can see this after you have run your search by clicking Jobs link -> Inspect Job and then open search.log
There are several .conf presentations and splunk blogs how to use this information.
The fields command is a distributable streaming command because it *can* run on indexers. That does not mean it cannot run on search heads.
It's also possible you are confusing "search-time" field extraction with something that only occurs on a search head. Indexers also perform search-time field extraction.
Hi @richgalloway , thank you.
1. by saying that indexers also perform search-time field extraction, you mean that the "interesting fields" can be extracted by the indexers? my thinking is that those fields need to be seen in at least of 20% of the events, and this kind of calculation can be performed at the search head.
2. by specifying a field using "fields" command (which is not an indexed field or field extracted with 'rex'), will it be extracted on the on the search head or on the indexers (assuming it fits for distributable streaming)?
Overall I am trying to understand if using 'fields' command by specifying a regular field (and not an indexed one) it affects the amount of data returning from the indexers to the search head.
Yes, the fields command can affect the amount of data returned to the SH. It's something I recommend to all of my customers as a way to improve search efficiency.
Just to be clear, the command does NOT extract any fields. It merely says which ones should be discarded (- option) or retained (+ option). Actual extractions are done by the rex command or the EXTRACT setting in props.conf.
Hi @richgalloway , could you explain how it affects the amount of data returning from the insexers to the SH?
I know that the command says which fields should be discarded or retained, but if the field specified is not an indexed field and I didnt use "rex", meaning that I refer to an "interesting field" which according to my understanding is extracted in the SH, how can it affect the amount of data?
unless you are saying that using "fields" command indirectly tells the indexers to extract those fields.
The indexers extract fields from events as they are read from the index. As @PickleRick implied, the effort put into that extraction is determined by the search mode (Fast, Smart, or Verbose). Each extracted field takes up memory for processing and network bandwidth to send to the SH. Using the fields command helps reduce the number of fields retained so you have memory and bandwidth.
Indexers do not decide if a field is interesting or not - the SH does that.
Splunk decides which fields to extract based on search commands and whether you use fast or verbose mode.
So you can limit the amount of data processed even in verbose mode by removing some fields (but it's better to just not use verbose mode and explicitly specify interesting fields).
But the other important use case is that Splunk returns the _raw field (and other defsult fields, but this one is usually most significant) which can be really memory-intensive, especially if you're dealing with huge json blobs or something similar.
And again - no, fields are not "extracted in SH". Fields are getting extracted at the very beginning of a search on an indexer (before other commands in the pipeline kick in).
Just because something is a search-time operation doesn't mean it happens on a SH.