While your lookup is pretty big for Splunk, it is not too big. When a lookup CSV file size is less than 10MB, it is loaded into an in-memory index, which is very fast. If you can afford the memory footprint and want to increase the size, you can set
max_memtable_bytes in limits.conf.
When a lookup CSV file is larger than that, Splunk will create an index for the lookup file on disk. You will see the index file alongside the CSV file in the lookups directory. Every time that Splunk needs to access the lookup table, it examines the timestamp of the CSV file and the index file, and rebuilds the index file if needed.
A lookup file larger than 2GB can be problematic, if it becomes part of a "knowledge bundle," but I doubt that this will be an issue in your case.
So - what is the size of your lookup file - not the number of lines?
Also, if the CSV file is generated every night, the first search that triggers a rebuild of the index file will be slow.
All of the above is to explain "how it works" if you don't use the KV Store. However, rebuilding the KV store is also an undertaking if you do it every night. Must you do a complete replacement of the KV Store, or could you set things up so that you only had to add/update/delete the relevant KV store entries? I know the add/update/delete might require more sophisticated code, but otherwise you are deleting and rebuilding the entire KV store every night. I am not sure that is an improvement over the basic CSV functionality, so perhaps it would make more sense to use the CSV file after all...
Also, based on @ssievert's comment on another answer in this thread: In "6.3, kvstore lookups are distributed to search peers": Since you are running 6.3, the 2GB knowledge bundle restriction should no longer apply.
The only way you're probably ever really going to know is to try both CSV and KVSTORE for your lookup, then inspect the job's timing.
I suggest starting with CSV and then simply copying the CSV into a KVSTORE (though, there's no reason why you couldn't go the other way). The KVSTORE will require some more setup (fields_list in
transforms etc.), however.
I've recently had some experience using KVSTORE (+200MB files and +500K lines). I was surprised when, initially, KVSTORE performance was worse then CSV for lookups (which was very poor also). However, KVSTORE did perform better when I started to using the
accelerated_fields options on the lookup fields, in
collections.conf. You may also want to define the
field data types for some fields. For example, a quoted floating point field in my input data was stored as a string (I believe). Doing this means
where field > num_value should work without having to do some
tonumber evals etc.
Actually, another great feature in 6.3 is the
inputlookup command's new
where clause (as mentioned above, careful with number fields) to filter the data set. As, no doubt you're aware, the more you reduce the data at each pipe the more efficient and generally quicker the query will be.
As a comparison of the two SPL queries I ended up testing; using CVS lookups took ~91 seconds and KVSTORE ~53 seconds. In the end KVSTORE was still not really quick enough for what was wanted.
I'm also really keen to hear other peoples experience (tips or tricks) using CSV or KVSTORE for large lookups.
The answers below are both good ones. Is there any reason why you cannot use a DB lookup and go straight against the database? Especially if your lookup data frequently changes, this may be the least involved approach and perform quite well if you can use indexed access to your DB table. Just wondering...
Is your concern search performance or size on disk?
- size on disk: you can include .gz file extension in your lookup population search and Splunk will compress it.
- search performance: if you are calling the lookup with the lookup command, you will want leave it as a flat file, since kvstore data is not replicated to the indexers, meaning you will lose the power of distribution and your search will be reduced back to the search head to preform the data enrichment.
The environment I came from, we successfully utilized many 500k line lookup tables that were 20+ columns wide.
Is this accurate as of 6.4? Is there any documentation on this at all?
We have kv stores in excess of 80GB per collection. If they are indeed distributed to search peers this would be an issue.