Solved: KVStore acceleration not working as expected?

ClubMed · ‎03-12-2024

This is an odd acceleration behavior that has us stumped... If some of you worked with Qualys Technology Add-on before, Qualys dump their knowledge base into a CSV file which we converted to kvStore with the following collections.conf accelerations enabled - The knowledge base has approx. 137,000 rows of about 20 columns.

[qualys_kb_kvstore]
accelerated_fields.QID_accel = {"QID": 1}
replicate = true

Then if you were to run the following query with lookup local= true and local=false (default). According to Job Inspector there was no real difference between lookup on search head vs. the indexers. Without the lookup command, the query takes 3 seconds to complete over 17 million events. With lookup added, it takes an extra 165 seconds for some reason with the accelerators turned on.

index=<removed> (sourcetype="qualys:hostDetection" OR sourcetype="qualys_vm_detection") "HOSTVULN" 
| fields _time HOST_ID QID 
| stats count by HOST_ID, QID 
| lookup qualys_kb_kvstore QID AS QID OUTPUTNEW PATCHABLE 
| where PATCHABLE="YES" 
| stats dc(HOST_ID) ```Number of patchable hosts!```

An idea I am going to try is to add PATCHABLE as another accelerated field and see if that changes. This change will require me to wait until tomorrow.

accelerated_fields.QID_accel = {"QID": 1, "PATCHABLE": 1}

Is there something we're missing to help avoid the lookup taking extra 2-3 minutes?

ClubMed · ‎03-28-2024

Hi Bowesmana,

I want to say I really appreciated the level of detail you have provided and it will prove useful for future people who lands here from google searches.

So I actually found the problem:

I use the Splunk Cloud product so we use a series of API calls to populate objects in the cloud, since we do not have access to the directories on their EC2 instances.

I found the Splunk App for Lookup File Editing API that our CI/CD process uses may have not supported kvStore acceleration options.

On github, I had the following collections.conf in our master branch – so I incorrectly assumed this was production:

[qualys_kb_kvstore]

accelerated_fields.QID_accel = {"QID": 1, "PATCHABLE": 1}

replicate = true

I tested the Splunk KVstore API against Splunk cloud, and discovered the two lines accelerated_fields and replicate was not there. It was not getting applied. Our CI/CD process was NOT getting the acceleration options applied.

I used the Splunk KVstore API instead to get acceleration options to apply, and it worked. The lookup is now accelerated as expected.

The lookup time was cut from 165 seconds to just 20 seconds.

View solution in original post

bowesmana · ‎03-12-2024

Have you tried making the Qualys CSV a CSV rather than KV and tried using that? Does it exceed the CSV size threshold?

What's the row count after the stats because you're only doing the lookup on the aggregated host count

Have you tried this approach instead?

| stats count values(QID) by HOST_ID
| lookup qualys_kb_kvstore QID AS QID OUTPUTNEW PATCHABLE 
| search PATCHABLE="YES" 
| stats dc(HOST_ID) ```Number of patchable hosts!```

which will reduce the stats host count to one row per host and then do an MV lookup - all you care about it PATCHABLE = "yes" in any of the returned results, so then changing where to search will find any MV value of YES

I do recall having some poor performing KV searches some time ago, but we ended up moving away from KV store anyway, because most of our lookups were needed to be done on the indexers, so unless you replicate, the data is returned to the SH and when KV is replicated, it ends up as CSV on the indexer anyway.

ClubMed · ‎03-12-2024

Thanks for the reply! My team ended up having an OnDemandService request to look into this. Will report back.

To answer your question by question.

The irony is that it was originally CSV that we have since converted to kvStore for performance. The CSV file had became big, like 250 MB if I recall. This is the Qualys Knowledge Base we are talking about that Qualys provide out of box so it is not something we can trim down to size as CSV.

The row count after the first stats and before lookup is 1,926,000. The fact stats calculates this in 20 seconds is perfectly fine. The problem becomes when lookup is used, it puts on an extra 140 seconds or so according to the Job Inspector. The dc(HOST_ID) ultimately ends with 7,800 rows.

Now for your suggested approach - good catch for the stats. That works great for this particular query that only cares about PATCHABLE. In fact, the last stats can be changed from dc() to just count. Much faster at 50 seconds now!

At some point however, we will need to provide full vulnerability data pulled from Qualys Knowledge Base through Splunk as the means of reporting for the engineering teams. We will run into this problem of the lookup hanging up for at least 140 seconds again.

Regarding poor performance, when you say 'replicate' - is this what you mean for collections.conf? Because the kvstore is already replicated.

[qualys_kb_kvstore]
accelerated_fields.QID_accel = {"QID": 1}
replicate = true

I think ultimately, we were under the impression the kvStore replication to the indexers makes it so there's a local copy handy for them, making a lookup really fast in matter of seconds. Maybe we had the wrong expectation?

bowesmana · ‎03-12-2024

That's a good reduction with the stats values, I assume the simple host count is significantly less than the 1.9m rows, so although you will have the same number of qids per host, the lookup command count will be reduced - it would be interesting to compare the job inspector details between the two searches.

As for KV store replication - a KV store on the search head is not a KV store on the indexer, instead a CSV is transferred to the indexers, so any accelerations in the KV are lost and you are simply using CSV lookups on the indexer.

I am not sure how a 250MB CSV on the indexer will be handled - if it works the same way as on the SH, it exceeds the

max_memtable_bytes

value discussed in limits.conf

https://docs.splunk.com/Documentation/Splunk/9.2.0/Admin/Limitsconf#.5Blookup.5D

so I imagine it will then be "indexed" (as in a file system index) on disk.

If you are not running in Splunk Cloud you may want to try a local CSV lookup and play around with the limits.conf settings for your app.

If you have the time, you might also want to experiment with using the eval lookup() statement and use a split CSV. For example, you could split the CSV into 10 * 25MB files to stay under the existing threshold and partition the QIDs into each lookup.

Then you could do some weird SPL like

| eval p=partition_logic(QID)
| eval output_json=case(
 p=1, lookup("qid_partition_1.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=2, lookup("qid_partition_2.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=3, lookup("qid_partition_3.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=4, lookup("qid_partition_4.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=5, lookup("qid_partition_5.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=6, lookup("qid_partition_6.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=7, lookup("qid_partition_7.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=8, lookup("qid_partition_8.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=9, lookup("qid_partition_9.csv", json_object("QID", QID), json_array("field1", "field2"),
 p=10, lookup("qid_partition_10.csv", json_object("QID", QID), json_array("field1", "field2"))

technically this could work, but whether you would see any improvements, I have no idea.

Was that 50 seconds using local lookup or standard lookup. If it was normal (remote) then try using local so it does use the local KV store

ClubMed · ‎03-28-2024

Hi Bowesmana,

I want to say I really appreciated the level of detail you have provided and it will prove useful for future people who lands here from google searches.

So I actually found the problem:

I use the Splunk Cloud product so we use a series of API calls to populate objects in the cloud, since we do not have access to the directories on their EC2 instances.

I found the Splunk App for Lookup File Editing API that our CI/CD process uses may have not supported kvStore acceleration options.

On github, I had the following collections.conf in our master branch – so I incorrectly assumed this was production:

[qualys_kb_kvstore]

accelerated_fields.QID_accel = {"QID": 1, "PATCHABLE": 1}

replicate = true

I tested the Splunk KVstore API against Splunk cloud, and discovered the two lines accelerated_fields and replicate was not there. It was not getting applied. Our CI/CD process was NOT getting the acceleration options applied.

I used the Splunk KVstore API instead to get acceleration options to apply, and it worked. The lookup is now accelerated as expected.

The lookup time was cut from 165 seconds to just 20 seconds.

KVStore acceleration not working as expected?

kvstore

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

.conf24 | Session Scheduler is Live!!

Introducing the Splunk Community Dashboard Challenge!