Extracting fields from Source (Index Time v/s Sear...

srikarmohan · ‎11-30-2021

Hello,

We are including the Pod Namespace and Pod Name in the Log Source (for K8s deployments) and would like these fields (Pod Namespace and Pod Name) to be extracted.

source: /var/lib/kubelet/pods/*/volumes/kubernetes.io~empty-dir/$(Volume Name)/$(POD_NS)/$(POD_NAME)/*.log

Most of our searches (including saved searches) will leverage both, if not atleast one of the two, fields and we were wondering if it is better (performance wise) to do the field extractions at Index Time or at Search Time.

It looks like the general practice is to opt for Search Time extraction, however there are may be cases where Index time extraction is preferred. The examples for using Index time extraction mentioned here (https://docs.splunk.com/Documentation/Splunk/8.2.3/Data/Configureindex-timefieldextraction) are not very clear, it seems like the 1st example might apply to our use case and so Index time might be preferred?

Thanks,

Srikar

PickleRick · ‎11-30-2021

Well, it depends on the use-case and your data characteristics. Remember than splunk searches data quite differently than - for example - your typical rdbms. It has its own indexes built from raw data split on delimiters so (maybe oversimplifying a bit, but not much) if you search for a "field=value" term, it first looks up all the occurrences of "value" within the events and then checks for which of them the events parse so that that value is in the field called "field".

So if you have, for example, ten different fields of which any can (and will) contain one of - let's say - ten values (repeated between those fields), you might benefit from indexed fields. There are other tricks you might use to speed up manipulation on big data sets like accelerated datasets and accelerated reports.

There are though two pros of indexed fields:

- you can do tstats on them which means you can do some statistical searches very quickly

- you can add some metadata to the event that is not present in the event itself (for example, I do it on my forwarders to be able to quickly see which forwarder the event came from)

yuanliu · ‎12-03-2021

Like PickleRick says, it depends on both your use case AND data characteristics.

Most of our searches (including saved searches) will leverage both, if not atleast one of the two, fields ... The examples for using Index time extraction mentioned here (https://docs.splunk.com/Documentation/Splunk/8.2.3/Data/Configureindex-timefieldextraction) are not very clear, it seems like the 1st example might apply to our use case and so Index time might be preferred?

From the cited example: "if you typically search a large event set with expressions like foo!=bar or NOT foo=bar, and the field foo nearly always takes on the value bar."

Just because most searches involve the two fields does not mean they fit this example. The example asks three additional questions:

Do most searches contain a negation of one or two of the "always on" fields POD_NS and POD_NAME? (i.e., POD_NS!=somespace and/or NOT POD_NAME=somename, etc.)
Do these searches mostly operate on large sets of events?
Do the negation(s) nearly always result in false? (i.e., nearly always POD_NS==somespace, and nearly always POD_NAME==somename.)

If the answer to any of the three questions is negative, that example doesn't apply.

Extracting fields from Source (Index Time v/s Search Time)

field extraction

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life