Solved: Regex Performance consideration

jeanyvesnolen · ‎03-28-2017

Hello All !

I ask myself what is the best approach to extract all fields of logs with regex in general.
I speak here of Search Time Extraction.

Is it better for performance to write 1 BIG regex with all capturing groups in a TRANSFORMS like

<props>
[bar]
TRANSFORM-foo_props = foo

<Transform>
[foo]
SOURCE_KEY = _raw
REGEX = .+?\,(.+?)\,(.+?)\,(+?)\,(\w{1,25}.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(*)
FORMAT = src::$1, user::$2, http_user_agent::$3 ........... custom_tag::$11

Or multiple light REGEX like

<props>
[bar]
EXTRACT-foo_src =  (?:.+?\,){1}(?P<src>.+?),
EXTRACT-foo_user = (?:.+?\,){2}(?P<user>.+?),
EXTRACT-foo_http_user =  (?:.+?\,){3},(?P<http_user_agent>.+?
....
EXTRACT-foo_http_user =  (?:.+?\,){11},(?P<custom_tag>.+?)

The regex is just an exemple, no need to comment it !

Thank you

DalJeanis · ‎03-28-2017

No question, the first would be more efficient. Your "light" regexes have to match the same fixed pattern over and over again, which is overhead without any purpose.

I would suggest ([^,]+), as a building block instead of (.+?),, just as a general practice. The performance will probably be the same in the case of a match, but in the case of a non-match, the backtracking will kill you as the system goes back to each comma and includes it into the (.+?) and tries again from that point, until it eventually backs up to the first group and processes it repeatedly until IT finally fails on the last comma. By telling the regex "commas can never be part of this group", it will fail once, politely, on a nonmatch.

I don't believe that escaping the comma is necessary in that regex, although it probably doesn't hurt anything.

View solution in original post

Jason · ‎10-25-2017

I'm assuming you are comparing EXTRACT- with REPORT- (not TRANSFORMS-) *

If the same regex for either, see https://answers.splunk.com/answers/10945/performance-of-extract-vs-report-for-same-regex.html
If comparing one long regex with numerous short regexes, I have not heard an answer to that.

Also, in this case if you are dealing with CSV or any delimited data, you can instead use FIELDS and DELIMS.

* as indexed vs search-time fields are another discussion, as are INDEXED_EXTRACTIONS for CSV.

DalJeanis · ‎03-28-2017

No question, the first would be more efficient. Your "light" regexes have to match the same fixed pattern over and over again, which is overhead without any purpose.

I would suggest ([^,]+), as a building block instead of (.+?),, just as a general practice. The performance will probably be the same in the case of a match, but in the case of a non-match, the backtracking will kill you as the system goes back to each comma and includes it into the (.+?) and tries again from that point, until it eventually backs up to the first group and processes it repeatedly until IT finally fails on the last comma. By telling the regex "commas can never be part of this group", it will fail once, politely, on a nonmatch.

I don't believe that escaping the comma is necessary in that regex, although it probably doesn't hurt anything.

woodcock · ‎03-28-2017

There is no question that if you were comparing apples-to-apples that the TRANSFORM would be more efficient because it is a 1-pass, rather than a multiple-pass, solution. However we are NOT comparing apples-to-apples in this case.

You can extract fields at the time of searching them or at the time of indexing them. In general, search-time extractions are preferable over index-time extractions, all things begin equal (take that statement loosely). There are many speed benefits to index-time extractions but they come at the cost of brittleness of configuration and a (sometimes very significant) increase in index size (disk space), as well as inescapable workload overhead on the indexers for every event (whereas if you have a search-time field extraction, you take a performance hit only when that data is searched, so it is a balance).

An extraction is index-time if it uses the TRANSFORMS- directive; it is search-time if it uses either the EXTRACT- or REPORT- directives; thus my apples-to-oranges assertion here.

The distinction in the UI of "uses transform" vs. "inline" doesn't have anything to do with search-time vs index-time. It is referring to where the regex itself is stored: in an EXTRACT- line in props.conf (for "inline") as opposed to in a REPORT- line that refers to a stanza in transforms.conf (for "uses transform").

richgalloway · ‎03-28-2017

Just my opinion, but I would expect the first option to be more efficient since it only scans the event once.

Why not try both, use the job inspector to view the performance numbers, and report back?

---
If this reply helps you, Karma would be appreciated.

jeanyvesnolen · ‎03-28-2017

For now, I don't have the time to try it
I'll try in a soon future and keep you update

Regex Performance consideration

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?