Solved: Search Performance: Does filtering on sourcetype, ...

manus · ‎10-16-2014

My understanding is that filtering on index is necessary. Sometimes it works without, but sometimes it doesn't and I don't get why, that could be the object of another question.

After, is it efficient to search on some particular fields like source, souretype or host rather than others?

Here is a more specific question:
Let's consider two indexes I1 and I2 which contain exactly the same events except:
For all events in I2, sourcetype="useless"
For all events in I2, the field log_type has the value of the sourcetype of the corresponding event in I1.

Are both searches equally efficient?
S1: index=I1 sourcetype="foo"
S2: index=I2 log_type="foo"

Aside from replying to the question, I'm interested in any link describing search efficiency best practices.

jrodman · ‎11-03-2014

For the comparison of sourcetype=foo vs log_type=foo, it all depends on how log_type is defined. If log_type=foo can be inferred to be only a keyword in the text of the event, then for many cases it will be equivalently efficient. We can rule out buckets for both cases in the bloom filter.

However, for cases where 'foo' is present in your events, but is not actually the value of log_type, then sourcetype=foo will be more efficient, because we can retreive only the events where sourcetype=foo from the index, while events that have 'foo' but where log_type is not 'foo' will have to be post-filtered after the rules to identify the value are identified.

If you have a wide variety of possible sources of log_type=foo, then the field based search will tend to diverge more in efficiency as compared to sourcetype=foo.

In short, use sourcetypes, host, and source when they make sense, if it's a search that will run over significant data or will be saved and reused. However the difference between searching on source/sourcetype/host and a field is often not large enough to make mangling those fields for your use-case worth it.

Typically, if that kind of mangling turned out to be worth it, it would be sufficient to use an indexed field in any event.

View solution in original post

jrodman · ‎11-03-2014

For the comparison of sourcetype=foo vs log_type=foo, it all depends on how log_type is defined. If log_type=foo can be inferred to be only a keyword in the text of the event, then for many cases it will be equivalently efficient. We can rule out buckets for both cases in the bloom filter.

However, for cases where 'foo' is present in your events, but is not actually the value of log_type, then sourcetype=foo will be more efficient, because we can retreive only the events where sourcetype=foo from the index, while events that have 'foo' but where log_type is not 'foo' will have to be post-filtered after the rules to identify the value are identified.

If you have a wide variety of possible sources of log_type=foo, then the field based search will tend to diverge more in efficiency as compared to sourcetype=foo.

In short, use sourcetypes, host, and source when they make sense, if it's a search that will run over significant data or will be saved and reused. However the difference between searching on source/sourcetype/host and a field is often not large enough to make mangling those fields for your use-case worth it.

Typically, if that kind of mangling turned out to be worth it, it would be sufficient to use an indexed field in any event.

manus · ‎11-27-2014

The test I did is in agreement with what you say.
I did the following test on events where log_type is a sub_string of source.
When I filter on log_type (log_type=XXXX), it is around 10 times less efficient than when I filter on source ( source=*XXXX*)

gfuente · ‎10-16-2014

Any filter that uses: index, host, source or sourcetype would be very efficient, as these data is stored as metadata and splunk finds it super fast. So yes, always include all those filters if possible.

If you use any other key value to filter then splunk had to use the index to filter, so the link provided by MuS details the performance impact of this kind of search

Regards

EDIT: Additional info from docs:

Leverage indexed and default fields whenever you can to help search or filter your data efficiently. At index time, Splunk extracts a set of default fields that are common to each event; these fields include host, source, and sourcetype. Use these fields to filter your data as early as possible in the search so that processing is done on a minimum amount of data. For example, if you're building a report on web access errors, search for those specific errors before the reporting command:

MuS · ‎10-16-2014

hi manus,

take a look at this http://answers.splunk.com/answers/172275/how-do-optimizations-for-field-based-searches-work.html

manus · ‎10-16-2014

Thanks MuS. This link taught me a lot I didn't know about search optimisation.
So how does it work for sourcetype, source and host. Are they in the keyword index like other AutoKV fields?

MuS · ‎10-16-2014

I would say yes ... but this is no official statement. It is based on the above answer and this docs http://docs.splunk.com/Documentation/Splunk/6.1.4/Search/Writebettersearches#Use_fields_in_your_sear...

If you're still not sure, post a comment to the above answer and I'm sure @jrodman can provide the correct answer 😉

marcoscala · ‎10-16-2014

Filtering on indexes actually is not necessary, but can be usefull. maybe you have to specify "index=xxx" just because that index is not among your default indexes (see Settings - Access Control - Roles - and check the indexes available in "Indexes searched by default").

As for efficiency, you can always check your search performance in the Job Inspector.

Marco

manus · ‎10-16-2014

Thanks, I'll try to answer my question by looking at the job inspector

Search Performance: Does filtering on sourcetype, source or host make a search more efficient?

Operationalizing TDIR: Building a More Resilient, Scalable SOC

Almost Too Eventful Assurance: Part 1

Demo Day: Strengthen Your SOC with Splunk Enterprise Security 8.1

Are you a member of the Splunk Community?

Search Performance: Does filtering on sourcetype, source or host make a search more efficient?

Operationalizing TDIR: Building a More Resilient, Scalable SOC

Almost Too Eventful Assurance: Part 1

Demo Day: Strengthen Your SOC with Splunk Enterprise Security 8.1