How to collect performance statistics about search...

lfrit · ‎10-02-2017

I'm trying to collect performance information about search-time field extractions happening on different search-peers, but even if I can see the total search duration for the available peers (e.g. ) I didn't find yet anything specific for the field extraction process.

The only "close-enough" bits of information that seem to be related with what I'm looking for are:

| rest /services/search/jobs splunk_server=local summarize=false | fields label, performance.*field*.duration_secs

And more in particular:

performance.command.fields.duration_secs
performance.command.search.calcfields.duration_secs
performance.command.search.fieldalias.duration_secs
performance.dispatch.evaluate.fields.duration_secs

Nevertheless those don't seem to be officially documented and seem to only refer to something happening on the search-head.

The thing is... from time to time I've seen some warning messages like Field extractor name=blablabla is unusually slow (max single event time=1036ms, probes=422 warning max=1000ms) so Slunk is actually collecting information about the field extraction process, now the question is how can I retrieve them for my analysis? 🙂

DalJeanis · ‎10-02-2017

@Ifrit - okay, I guess you need some explicit context, in order to place what I was trying to tell you. I'm going to focus on the question, "How do I find and fix what is causing slow extractions and searches?"

1) If you consider the impact on the entire system of collecting data about how long each extraction takes, you will realize that this information rationally cannot be tracked at the per-rule level. This kind of tracking would slow down overall system performance by several percent -- I suspect in the 13-28% range -- while only giving value in very narrow circumstances.

2) If you have non-programmers creating field extractions using the GUI, or programmers who have not been taught how to think in regex, then there are going to be a lot of regexes that are just plain inefficient, because the interface understands the characters in front of it, but doesn't understand anything about "why". To help identify and solve this, you can just drill down on any long-running searches, and fix whatever is wrong with them. If it's a regex, great. If not, then you fix something else.

3) You COULD test extract time on a field-by-field basis for search-time extractions by setting an ad hoc query to fast and extracting only a single field. It would probably be more programmer-efficient to do a synthetic binary search - take the list of all the fields, divide it in half or thirds or quarters, run each portion, and select whichever one is "large" with respect to the others. Repeat until you identify whether there is an offending field that can be optimized.

4) Or, you could just grep to extract the full list of regexes and look for the condition that I told you about below. With a little practice, they are very easy to spot. (Multiple consecutive expressions with * that are not mutually exclusive.) Also, any regex that appears overcomplicated or difficult to read (if you can normally read regex) is suspect.

By the way, an @ message is just as easy to get my attention.

Okay, that duration is ridiculous for a single extraction, so you probably have a regex that is doing catastrophic backtracking.

Best to take that particular field extractor and post it on a new question to get feedback as to how to fix it.

Generally, this happens when you have multiple ways that the regex could match the same field, so when it FAILS to match, it has to try all the other ways to break up the field.

This often happens when you have multiple items next to each other that might match the same character.

.*[a-z]*.*[a-z]*.*(PlusSomethingThat doesnt'GetMatched!!!).*

In this case, there are five different chunks before the hard-to-match chunk that can each capture anywhere between zero characters and the entire input. The system has to try each of those combinations -- ((n+1)!/(4!(n-4)!) of them for n characters -- before it can quit, and it has to keep track of all the points that it might have to back up to, so it quickly runs out of little digital fingers to count on and hold bookmarks with.

In the case of the above example, an easy way to reduce the backtracking is to flip the middle .* to where it will not match [a-z]. Of course, if [^a-z] item is not required, then it shouldn't be in the pattern at all, so let's change that to a +. Since any number of [a-z] after the [^a-z] would also be matched by the .* that follows it, we can require only a single [a-z] and it has the same effect without backtracking. We can also do that to the [a-z]* before the [^a-z] as well, and it cleans up pretty nicely.

.*[a-z][^a-z]+[a-z].*(PlusSomethingThatDoesntGetMatched!!!).*

In practice, I'd probably make the first two .* lazy as well - .*? - so they will defer to the first [a-z][^a-z]+[a-z] and PlusSomethingThatDoesntGetMatched!!! that they find.

lfrit · ‎10-03-2017

I downvoted this post because off-topic imho.

lfrit · ‎10-03-2017

Thanks for your answer but I think it's a bit off-topic here, I'm not asking what's the reason behind a specific issue but how to assess the whole search-time extraction process to eventually focus on the issues in a targeted way.

The point is, I have tons of extraction rules coming from many different TA apps and I'd like to review the most problematic ones.

gjanders · ‎10-02-2017

Have you tested looking at the search.log file of a particular job ? To see if it has some more hints as to what might be happening here?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

lfrit · ‎10-02-2017

Yep, I took a look at some search.log files but I didn't find anything useful on the topic.

gjanders · ‎10-02-2017

Oh ok, I was assuming if anything the data would appear on the search.log of the search peers for the specific job id. However I haven't looked into this area in depth and i have no access to my splunk instance for a few weeks 😞

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

How to collect performance statistics about search-time field extractions?

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation

How to collect performance statistics about search-time field extractions?

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...