Hello Splunkers,
I’m looking for the best algorithm to search for events. with the below criteria.
I have a lookup with only one field but multi-valued. About 10000 lines, for example,
“vatsal, jagani”
“10.0.0.1,“10.0.0.2”
I want to search index=abc, for the last 2 hours (about 50 events) to see if there are at least two events (but can be more) that contain words from one set.
For example.
event-1 - “hello, I’m Vatsal.
event-2 - “hello, I’m jagani too.”
here, two events have matching words from the same lookup field.
Another example,
event-3 - “hi, vatsal”
event-4 - “hello, vatsal”
this also considers matching.
And I want to run this alert every hour.
Solution-1 - I could use the map command as below but I don't think that's very efficient.
| inputlookup words_lookup.py
| eval or_field = <convert words to or list like "vatsal" OR "jagani">
| map max_count=1000000 "search index=abc $or_field$"
Solution-2 - I could write a Python script, but I'm not sure what algorithm to use.
I'm looking for a more efficient query or python algorithm to do this efficiently.
That's a tough problem. It's not a Splunk-tough problem but a generally tough problem.
in order to find the matches... you need to do the comparisons. And that's the biggest problem here. Since you don't have a fixed field which you want to look up but want to use the lookup as a list of patterns to match against your whole raw event (at least that's how I interpret your requirement), you have to do m*n "searches" against your data where m is the number of your events and n is the number of distinct values in your lookup.
If you know you can split the events into separate words, that might make it a bit easier because you don't have to match your raw event against terms from the lookup but rather do a lookup with the words from the event (which could be marginally faster since it's more probable than you'll match something before reaching the end of the lookup).
There are several possible approaches here but I'm not sure which one would be fastest given the size of your data. The more events you have to match, the more it's tempting to create something matching cleverly over a sorted list of your terms from the lookup.
(to make things a bit more complicated one has to remember that each "comparison" is also not an atomic operation but also depends on the length of the strings and the match ratio).
You are right @PickleRick .
I'm guessing I'm left with 2nd option of building with Python script inside a custom command. And I need to spend some time on building an algorithm that best suits performance. I'll experiment.