Splunk Search

For large lookups, should I use the kv store?

a212830
Champion

Hi,

I am developing a dashboard and search that needs to utilize a large lookup file (75k lines) that gets generated every night using a DB Connect export. I am developing this in Splunk 6.3. Should I use the kv store for this, or standard lookups?

Tags (2)

dcarmack_splunk
Splunk Employee
Splunk Employee

Is your concern search performance or size on disk?

- size on disk: you can include .gz file extension in your lookup population search and Splunk will compress it.

- search performance: if you are calling the lookup with the lookup command, you will want leave it as a flat file, since kvstore data is not replicated to the indexers, meaning you will lose the power of distribution and your search will be reduced back to the search head to preform the data enrichment.

The environment I came from, we successfully utilized many 500k line lookup tables that were 20+ columns wide.

s2_splunk
Splunk Employee
Splunk Employee

Minor comment: As of 6.3, kvstore lookups are distributed to search peers.

Lucas_K
Motivator

Is this accurate as of 6.4? Is there any documentation on this at all?

We have kv stores in excess of 80GB per collection. If they are indeed distributed to search peers this would be an issue.

0 Karma

lakshman239
Influencer

The default setting in 6.3 is not to replicate - http://dev.splunk.com/view/SP-CAAAEZJ with 'replicate=false'

0 Karma

s2_splunk
Splunk Employee
Splunk Employee

The answers below are both good ones. Is there any reason why you cannot use a DB lookup and go straight against the database? Especially if your lookup data frequently changes, this may be the least involved approach and perform quite well if you can use indexed access to your DB table. Just wondering...

0 Karma

gcato
Contributor

The only way you're probably ever really going to know is to try both CSV and KVSTORE for your lookup, then inspect the job's timing.

I suggest starting with CSV and then simply copying the CSV into a KVSTORE (though, there's no reason why you couldn't go the other way). The KVSTORE will require some more setup (fields_list in transforms etc.), however.

I've recently had some experience using KVSTORE (+200MB files and +500K lines). I was surprised when, initially, KVSTORE performance was worse then CSV for lookups (which was very poor also). However, KVSTORE did perform better when I started to using the accelerated_fields options on the lookup fields, in collections.conf. You may also want to define the field data types for some fields. For example, a quoted floating point field in my input data was stored as a string (I believe). Doing this means where field > num_value should work without having to do some tonumber evals etc.

Actually, another great feature in 6.3 is the inputlookup command's new where clause (as mentioned above, careful with number fields) to filter the data set. As, no doubt you're aware, the more you reduce the data at each pipe the more efficient and generally quicker the query will be.

As a comparison of the two SPL queries I ended up testing; using CVS lookups took ~91 seconds and KVSTORE ~53 seconds. In the end KVSTORE was still not really quick enough for what was wanted.

I'm also really keen to hear other peoples experience (tips or tricks) using CSV or KVSTORE for large lookups.

lguinn2
Legend

While your lookup is pretty big for Splunk, it is not too big. When a lookup CSV file size is less than 10MB, it is loaded into an in-memory index, which is very fast. If you can afford the memory footprint and want to increase the size, you can set max_memtable_bytes in limits.conf.
When a lookup CSV file is larger than that, Splunk will create an index for the lookup file on disk. You will see the index file alongside the CSV file in the lookups directory. Every time that Splunk needs to access the lookup table, it examines the timestamp of the CSV file and the index file, and rebuilds the index file if needed.
A lookup file larger than 2GB can be problematic, if it becomes part of a "knowledge bundle," but I doubt that this will be an issue in your case.

So - what is the size of your lookup file - not the number of lines?

Also, if the CSV file is generated every night, the first search that triggers a rebuild of the index file will be slow.

All of the above is to explain "how it works" if you don't use the KV Store. However, rebuilding the KV store is also an undertaking if you do it every night. Must you do a complete replacement of the KV Store, or could you set things up so that you only had to add/update/delete the relevant KV store entries? I know the add/update/delete might require more sophisticated code, but otherwise you are deleting and rebuilding the entire KV store every night. I am not sure that is an improvement over the basic CSV functionality, so perhaps it would make more sense to use the CSV file after all...

a212830
Champion

Thanks for the info. The file is created daily, and it's usually around 75mb.

0 Karma

lguinn2
Legend

Also, based on @ssievert's comment on another answer in this thread: In "6.3, kvstore lookups are distributed to search peers": Since you are running 6.3, the 2GB knowledge bundle restriction should no longer apply.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...