Splunk Search

Slow search when evaluating a numeric value.

Communicator

Hi gurus,

We replacing DW / Olap project with splunk. We are in the process of comparing speed of queries which involves joining a search that does SOME_VALUE>50 . When we do this, splunk evaluates for example, a million of all records, evaluating every record and very slow.

In database world, we can setup a index table for a key field for faster evaluation. Is there anyway, we can do a similar trick in splunk to evaluate the speed of search based on value evaluation.

This is a critical performance issue that determines whether we can replace olap / DW projects.

I appreciate many of experts understand splunk limitation that could help us in addressing this performance issue.

Tags (2)

Legend

I think it is impossible to compare two different types of data storage by using the same queries. SQL and Splunk can both use the phrase "where x > 50", but they work in very different ways.

To get a real comparison, I would prepare a list of requirements that are relevant to your needs, but generic: "we need to identify all users who have exceeded their bandwidth terms of service." Then formulate the searches/queries in each tool.

My personal is experience is that Splunk is blazingly fast for some queries that were impossible for an RDBMS. But it can be slower for searches where you have a priori knowledge of the data and the structure - and both the data structure and the query structure are unchanging.

If you ask questions about specific queries (and provide some sample sanitized data), I am sure that the folks on this forum can give you some great optimization tips. For example, if you turn off "Field Discovery", your searches will probably run significantly faster. I often see a 3x improvement, but it is highly dependent on the data. It's certainly easy to try...

0 Karma

SplunkTrust
SplunkTrust

A sample of some input would help, but there are some things you can do to help speed this up.

In understanding this, it's important to note that Splunk typically does a "late binding" of field names to field values. This is entirely the opposite of a relational database, where all column names must be defined well in advance. What this means for you is that a field extraction that associates the name SOME_VALUE with its numerical value does not occur until after events have been loaded from disk and scanned for field extraction.

Given an event of the form:

07/26/2012 10:30:01 foo bar baz userid=bob height=69 weight=205

Splunk will find and index the tokens "07", "26", "2012", "10", "30", "01", "foo", "bar", "baz", "userid", "bob", "height", "69", "weight", "205". But the knowledge that height is a field name and the value is 69 is not stored at index time. It is recreated at search time by scanning the text of the the event and doing key-value extraction.

So, a search of height > 72 (to find particularly tall people) would have to do something similar to "search for everything, apply key-value extraction, and then post-process based on the fields discovered during extraction.

In the example above, modifying your search to something like height AND height > 72 gives Splunk a hint that the word "height" appears in the events you are interested in. This is a powerful hint that lets it reduce the "everything" down to "everything containing the word height", and then apply key-value extraction and finally post-process the numerical comparison. This hint is really only useful in the examples where the "height" field is of a (relatively) low density in your indexed data. If 99% of all of your events have the word "height", then this is a much less useful optimization trick.

Alternately, you can take the indexed fields route. This has the advantage of associating field name with field value in the index itself, as the cost of the flexibility of late binding. However, I don't know if the Splunk keyword index can efficiently do ranged searches over the value of an indexed field. ( This is where an RDBMS index may be different, because the columns in the index are sorted in the index data structure [usually a B+Tree] allowing the RDBMS to be able to do ranged scans of the indexed values ) Someone else with much more Splunk internal knowledge will need to comment on if this holds any water or not.