Splunk Search

"OR" and subsearch, which one is more efficient?

Builder

I have a large set of logs and two sets of mutually exclusive criteria, one signifies beginning and progression of an "event", the other signifies the end of the event, linked by a unique identifier. I can use any of three approaches to extract the duration of the event:

  1. (Criteria1 OR Criteria2) | stats latest(_time) as end earliest(_time) as begin by EventId | eval duration=end-begin
  2. (Criteria1 OR Criteria2) | transaction EventId
  3. Criteria1 | stats earliest(_time) as begin by EventId | join EventId [search Criteria2 | stats latest(_time) as end by EventId] | eval duration=end-begin

Which method is more speedy/less taxing? Is there a speedier method?

I have learned to avoid transaction, so I used method 1 with Criteria1 alone, before discovering Criteria2, i.e., Criteria1 | stats latest(_time) as end earliest(_time) as begin by EventId | eval duration=end-begin. When I added Criteria2, the search slowed to a screech. That was when I started to contend with method 3. Because the observation that adding an "OR" criteria drastically slows down search is anecdotal and because apparent speed in Splunk is affected greatly by cache hit, I want to have a realistic assessment about efficiency. (Obviously I had done searches with only Criteria1 on the same data set before noticing slowdown when Criteria2 was added.)

Also note that the actual output is more than just duration - I want to preserve event start time, for example.

0 Karma
1 Solution

SplunkTrust
SplunkTrust

I think the true answer lies in the details of Criteria2, in that the "OR Criteria2" is making Splunk get many many more, or maybe all of the events off disk now. Can you give more details as to what expression(s) comprise Criteria 2?

The specific answer here, is that #1 is the fastest. Splunk will optimize this search so that it only extracts the field values for _time, EventId and whatever fields are referenced in Criteria1 and Criteria2. All other regexes and calculated fields and lookup fields are ignored, at least if it's being run by the scheduler or in the UI in "fast" or "smart" mode.

The transaction version on the other hand, preserves all the field values in the final transaction rows, so therefore it has to extract all possible field values.

However, depending on specifics the difference between #1 and #2 might not be all that great. Of course you can measure it yourself -- in "inspect job info", you will find a very exact runDuration number.

option 3 with the join command is terrible. The join command should only be a last resort where what you want to "join" cannot be expressed as a disjunction like (foo OR bar), optionally normalized and tweaked by eval commands, and then given to "stats count by someId someOtherId"*. Plus with join you have not only poor performance but also limits on the search in the square brackets, limits in terms of number of result rows as well as execution time. Join will truncate and/or finalize the search in the square brackets if it exceeds the limits.

*As an example of this, you might need to join the output of " | metadata type=sourcetypes", or of the rest command, to some other more ordinary results from indexed data, and there is no way to express that sort of thing as a disjunction. Or you might need to run some advanced command like mvexpand or xyseries on one half of the data but without it touching the other half at all.

View solution in original post

SplunkTrust
SplunkTrust

I think the true answer lies in the details of Criteria2, in that the "OR Criteria2" is making Splunk get many many more, or maybe all of the events off disk now. Can you give more details as to what expression(s) comprise Criteria 2?

The specific answer here, is that #1 is the fastest. Splunk will optimize this search so that it only extracts the field values for _time, EventId and whatever fields are referenced in Criteria1 and Criteria2. All other regexes and calculated fields and lookup fields are ignored, at least if it's being run by the scheduler or in the UI in "fast" or "smart" mode.

The transaction version on the other hand, preserves all the field values in the final transaction rows, so therefore it has to extract all possible field values.

However, depending on specifics the difference between #1 and #2 might not be all that great. Of course you can measure it yourself -- in "inspect job info", you will find a very exact runDuration number.

option 3 with the join command is terrible. The join command should only be a last resort where what you want to "join" cannot be expressed as a disjunction like (foo OR bar), optionally normalized and tweaked by eval commands, and then given to "stats count by someId someOtherId"*. Plus with join you have not only poor performance but also limits on the search in the square brackets, limits in terms of number of result rows as well as execution time. Join will truncate and/or finalize the search in the square brackets if it exceeds the limits.

*As an example of this, you might need to join the output of " | metadata type=sourcetypes", or of the rest command, to some other more ordinary results from indexed data, and there is no way to express that sort of thing as a disjunction. Or you might need to run some advanced command like mvexpand or xyseries on one half of the data but without it touching the other half at all.

View solution in original post

Builder

This is getting interesting. Criteria1 has the form field=value1 "literal string". Criteria2 has the form field=value2 "string * with wildcard". In the data set under test, fewer events match Criteria2. (field= is unnecessary when the string portion match; individually, field=value1 would return more events.)

Thanks @somesoni2 for the min-max suggestion.

0 Karma

SplunkTrust
SplunkTrust

If there are very few or zero breaker characters (think punctuation) in the "string * with wildcard", Criteria2 may very well be requiring that Splunk get practically all of the events off disk. What Splunk actually does to match wildcard searches is complex. I would troubleshoot by running a search for just Criteria1, then inside "Job > inspect job", find the two numbers eventCount and the scanCount. scanCount is the number of events that Splunk had to get off disk, whether or not they ended up matching. eventCount is just the number of events that actually matched the searchterms. Obviously on efficient searches scanCount will be the same or only slightly more than eventCount. I believe if you do the same test on Criteria2, you'll see a scanCount that is far far larger than eventCount. I would try adding more searchterms, either inside the parens with Criteria2, or just shared terms outside the OR expression, to help Splunk narrow the space down so it doesn't have to get the world off disk and then match against the wildcard string in memory.

Builder

ScanCount and eventCount are very useful tools! I think performance degradation after Criteria2 is due to shear event volume increase. Here are a few data points.
1. Criteria2 returns >10x more events than Criteria1 does.
2. Criteria1 has a scan to event overhead of ~100%, whereas Criteria2 has <1%. (When they are OR'd, overall overhead is <1%.)
3. Adding one = sign does not change overhead, nor does adding additional field selectors.

As I explained in the original problem, the goal is to detect delays. Only delayed events give logs matching Criteria1, but all events give a log matching Criteria2. In my data set, even though each delayed event can contribute to more than one Criteria1 log, the total is still small compared to total matching Criteria2. I cannot think of any way to narrow down the event-end log to select only delayed events - that's the whole point of using Criteria1. Now if I can use Criteria1 as a subsearch to Criteria2 as in the SQL sense, that might make search less painful.

0 Karma

SplunkTrust
SplunkTrust

OK. If I follow you on the overhead issue, with Criteria1, scanCount and eventCount are about the same? And with Criteria2, scanCount is 100 times larger than eventCount? There are two very different ways of interpreting the numbers you're giving, but I think that's the only one that explains how both criteria with an OR, could also have "overhead" <1%.

Assuming that's the case, and about my advice to add searchterms. I'm not saying add searchterms that specify only delayed events, just searchterms that allow it to narrow things down a little. Try things. For instance if you have more than one sourcetype in the index, just adding a sourcetype term would be good. Or adding another text token or two that you know will be in the delayed events even if they're also in the non-delayed events. Splunk does implement wildcards with some fairly cool phrase search functionality, but it sounds like you're in some uncommon set of conditions where it has no other option than just spooling every event out of the index and filtering in ram.

Subsearch - yes you might try that. If there are unique ID's or something to link the two together. That's exactly what subsearches are for, although there have to be less than I think 10,000 unique id's, and the smaller of the two searches needs to complete in less than some number of seconds like 30.

large painful search [ much smaller faster search | fields someUniqueId ]
0 Karma

Builder

Wow - I initially thought that long literals (no punctuation) will lessen the impact of wildcard. I'll do more inspections as you suggested. Maybe it's better to just use two literals broken down at the wildcard? I can possibly go with just one of the literals - just not enough confidence in how distinct are log messages.

0 Karma

SplunkTrust
SplunkTrust

IMO, join and transaction both very resource intensive operations and often not efficient at all. Option 1 is the most efficient way. Adding another criteria might have increase the execution time as it has to process more no of events (its an OR condition.

Try this another way to do the option 1. (_time is epoch and numeric operations may be faster).

(Criteria1 OR Criteria2) | stats max(_time) as end min(_time) as begin by EventId | eval duration=end-begin
0 Karma