Thanks for your detailed response. @genesiusj wrote: Q - "What size is your lookup - you may well be hitting the default limits defined (25MB)" A - csv: 1 million records - 448,500 bytes // ...
See more...
Thanks for your detailed response. @genesiusj wrote: Q - "What size is your lookup - you may well be hitting the default limits defined (25MB)" A - csv: 1 million records - 448,500 bytes // kvstore: 3 million records - 2,743.66 MB That seems wrong - 1,000,000 records must be more than 448,500 bytes - there has to be at least a line feed between rows which would give you 1,000,000 bytes. Anyway, if the CSV is the origin dataset, then I don't think the lookup limit is going to be relevant, but are you doing something like | inputlookup csv
| eval mod_addr=process_address...
| lookup my_kvstore addr as mod_addr output owner The fact that this is all happening on the search head means that the SH will probably be screaming - what is the size of the SH and have you checked its performance profile during the search? Q - "What are you currently doing to be 'fuzzy' so your matches currently work or are you really looking for exact matches somewhere in your data?" A - I stripped off any non-numeric characters at the beginning of the address on the lookup and use that field for the as in my lookup command with my kvstore | lookup my_kvstore addr as mod_addr output owner I have in the past done something similar using wine titles, so I have "normalised" the wine title by removing all stop words, all words <= 3 characters, all numbers. I then split to a MV field, convert to lower case then sort and join. I have done this in the base dataset (i.e. your KV store) and also in all wines I see. It is reasonably reliable. However, that doesn't really solve your issue with the volume... Q - Also, if you are just looking at some exact match somewhere, then the KV store may benefit from using accelerated fields - that can speed up lookups against the KV store (if that's the way you're doing it) significantly. A - Using the above code, the addr would be the accelerated field, correct? Yes, I have seen very good performance improvements with large data sets using accelerated_fields, so do this first. If you have the option to boost the SH specs, that may benefit, but first check what sort of bottleneck you have on the SH.