Splunk Search

different values in a field result in different processing times??

asimagu
Builder

hey guys

I got an odd behavior today in Splunk.

When I ran: index=A sourcetype=A m=4 OR m=404 OR m=1233 the search was running for 30 minutes (there are lots of events involved)

but if I omit "m=4" the search only takes 2 minutes to run.

I do not understand why this is happening. m is a numerical field and I was not expecting to be any difference between my two searches... with m=4 and without m=4

how do you explain this??

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Heh. You have discovered the wonder of "bloom filters". What Splunk is doing under the covers -- and I am being REALLY loose in the way I describe it -- is checking every page of data that has any of the values that you've listed.

Splunk speeds up its searches by only checking the pages/events that definitely have the key words that you are looking for. That is really efficient when the values you are looking for are rare. However, when you add m=4, the search engine is going to have to laboriously check any event that has the value 4 in any field. Those will be pretty common.

Performance is going to be highly data dependent. When actual performance fails to match theory, believe the actual performance. In this case, if you don't have the ability to index the field m at index-time, then you are going to have to play around with different ways to get at it at search time

Here's the first couple of things I'd try.

If the field m is not on all the records, then this could help...

index=A sourcetype=A m=* | fields ... list only the fields you want ... | search m=4 OR m=404 OR m=1233

If there is a unique key field (mykey) for the records you want, this might help...

index=A sourcetype=A [ search index=A sourcetype=A m=* | fields m mykey  | search m=4 OR m=404 OR m=1233 | table mykey]

What the above code does is, in the subsearch, search thru the index and sourcetype, returning only the key and m. Then it checks to make sure m is one of the values you want, and then returns ONLY the keyfield. When it hits the end bracket ] of the subsearch,

the selected results are implicitly returned as if the format command had been used. It will return a string that looks like this:

( (mykey="firstvalue" ) OR ( mykey="second value" ) OR .... OR ( mykey="last value" ) )

Assuming the keys are unique, that might cut a bunch of time off.

There may be some other ways, but those two are the first shots that I'd take.

0 Karma

asimagu
Builder

thanks for your answer @DalJeanis . I am still trying to understand your explanation.
You say this "when you add m=4, the search engine is going to have to laboriously check any event that has the value 4 in any field" but I am telling Splunk to use m as the field, righto?

0 Karma

asimagu
Builder

also, I believe the field m is present in all the events. would you recommend that I extract this field at indexing time then?

0 Karma

Sukisen1981
Champion

if you run the search with just one value of m 3 times, does m=4 still take the longest to execute? could it be that in reality the most events are coming from m=4?

asimagu
Builder

no, it's not a matter on how many events come with m=4 as I don´t have any events in the last 24h that show that value. Despite having no events with that value I get this odd behaviour

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!