For large lookups, should I use the kv store?

a212830 · ‎12-14-2015

Hi,

I am developing a dashboard and search that needs to utilize a large lookup file (75k lines) that gets generated every night using a DB Connect export. I am developing this in Splunk 6.3. Should I use the kv store for this, or standard lookups?

dcarmack_splunk · ‎12-15-2015

Is your concern search performance or size on disk?

- size on disk: you can include .gz file extension in your lookup population search and Splunk will compress it.

- search performance: if you are calling the lookup with the lookup command, you will want leave it as a flat file, since kvstore data is not replicated to the indexers, meaning you will lose the power of distribution and your search will be reduced back to the search head to preform the data enrichment.

The environment I came from, we successfully utilized many 500k line lookup tables that were 20+ columns wide.

s2_splunk · ‎12-15-2015

Minor comment: As of 6.3, kvstore lookups are distributed to search peers.

Lucas_K · ‎07-11-2016

Is this accurate as of 6.4? Is there any documentation on this at all?

We have kv stores in excess of 80GB per collection. If they are indeed distributed to search peers this would be an issue.

lakshman239 · ‎04-25-2017

The default setting in 6.3 is not to replicate - http://dev.splunk.com/view/SP-CAAAEZJ with 'replicate=false'

s2_splunk · ‎12-14-2015

The answers below are both good ones. Is there any reason why you cannot use a DB lookup and go straight against the database? Especially if your lookup data frequently changes, this may be the least involved approach and perform quite well if you can use indexed access to your DB table. Just wondering...

gcato · ‎12-14-2015

The only way you're probably ever really going to know is to try both CSV and KVSTORE for your lookup, then inspect the job's timing.

I suggest starting with CSV and then simply copying the CSV into a KVSTORE (though, there's no reason why you couldn't go the other way). The KVSTORE will require some more setup (fields_list in transforms etc.), however.

I've recently had some experience using KVSTORE (+200MB files and +500K lines). I was surprised when, initially, KVSTORE performance was worse then CSV for lookups (which was very poor also). However, KVSTORE did perform better when I started to using the accelerated_fields options on the lookup fields, in collections.conf. You may also want to define the field data types for some fields. For example, a quoted floating point field in my input data was stored as a string (I believe). Doing this means where field > num_value should work without having to do some tonumber evals etc.

Actually, another great feature in 6.3 is the inputlookup command's new where clause (as mentioned above, careful with number fields) to filter the data set. As, no doubt you're aware, the more you reduce the data at each pipe the more efficient and generally quicker the query will be.

As a comparison of the two SPL queries I ended up testing; using CVS lookups took ~91 seconds and KVSTORE ~53 seconds. In the end KVSTORE was still not really quick enough for what was wanted.

I'm also really keen to hear other peoples experience (tips or tricks) using CSV or KVSTORE for large lookups.

lguinn2 · ‎12-14-2015

While your lookup is pretty big for Splunk, it is not too big. When a lookup CSV file size is less than 10MB, it is loaded into an in-memory index, which is very fast. If you can afford the memory footprint and want to increase the size, you can set max_memtable_bytes in limits.conf.
When a lookup CSV file is larger than that, Splunk will create an index for the lookup file on disk. You will see the index file alongside the CSV file in the lookups directory. Every time that Splunk needs to access the lookup table, it examines the timestamp of the CSV file and the index file, and rebuilds the index file if needed.
A lookup file larger than 2GB can be problematic, if it becomes part of a "knowledge bundle," but I doubt that this will be an issue in your case.

So - what is the size of your lookup file - not the number of lines?

Also, if the CSV file is generated every night, the first search that triggers a rebuild of the index file will be slow.

All of the above is to explain "how it works" if you don't use the KV Store. However, rebuilding the KV store is also an undertaking if you do it every night. Must you do a complete replacement of the KV Store, or could you set things up so that you only had to add/update/delete the relevant KV store entries? I know the add/update/delete might require more sophisticated code, but otherwise you are deleting and rebuilding the entire KV store every night. I am not sure that is an improvement over the basic CSV functionality, so perhaps it would make more sense to use the CSV file after all...

a212830 · ‎12-15-2015

Thanks for the info. The file is created daily, and it's usually around 75mb.

lguinn2 · ‎12-16-2015

Also, based on @ssievert's comment on another answer in this thread: In "6.3, kvstore lookups are distributed to search peers": Since you are running 6.3, the 2GB knowledge bundle restriction should no longer apply.

For large lookups, should I use the kv store?

SOC4Kafka - New Kafka Connector Powered by OpenTelemetry

Your Voice Matters! Help Us Shape the New Splunk Lantern Experience

Building Momentum: Splunk Developer Program at .conf25

Are you a member of the Splunk Community?

For large lookups, should I use the kv store?

SOC4Kafka - New Kafka Connector Powered by OpenTelemetry

Your Voice Matters! Help Us Shape the New Splunk Lantern Experience

Building Momentum: Splunk Developer Program at .conf25