Re: Using a lookup table to store regex patterns t...

klim · ‎01-02-2024

Is it possible to store regex patterns in a lookup table so that it can be used in a search?

For example lets say I have these following regexes like "(?<regex1>hello)" and "(?<regex2>world)". My actual regexes are not simple word matches.

I want to write another query that basically runs a bunch of regexes like

| rex field=data "regex1"
| rex field=data "regex2"
etc

| makeresults 1 | eval data="Hello world" [| inputlookup regex.csv | streamstats count | strcat "| rex field=data \"" regex "\"" as regexstring | table regexstring  | mvcombine regexstring]

is it possible to use the subsearch to extract the regexes and then use them as commands in the main query?

I was trying something like

| makeresults 1 | eval data="Hello world" [| inputlookup regex.csv | streamstats count | strcat "| rex field=data \"" regex "\"" as regexstring | table regexstring  | mvcombine regexstring]

so that the subsearch outputs the following

| rex field=data "(?<regex1>hello)"
| rex field=data "(?<regex2>world)"

dtburrows3 · ‎01-02-2024

I can get it working to an extent, not sure if this method will exactly fit your use-case but will leave it here for you.

So with a lookup named "test_regex_lookup.csv"

pattern_type	regex
date	\d{2}\/\d{2}\/\d{4}
SSN	\d{3}\-\d{2}\-\d{4}

We are able to pull in these regex patterns into a parent search via eval and then use these patterns in another eval to extract data.

Example.

| makeresults 
    | eval
        data="very personal info on John Doe: Birthday: 04/12/1973 and SSN: 123-45-6789"
    ``` pull in regex patterns from lookup ```
    | eval
        ssn_regex=[ | inputlookup test_regex_lookup.csv where pattern_type="SSN" | fields + regex | eval regex="\"".'regex'."\"" | return $regex ],
        bday_regex=[ | inputlookup test_regex_lookup.csv where pattern_type="date" | fields + regex | eval regex="\"".'regex'."\"" | return $regex ]
    ``` use regex pattern fields to extract matches from another field "data" ```
    | eval
        ssn=replace(data, ".*(".'ssn_regex'.").*", "\1"),
        bday=replace(data, ".*(".'bday_regex'.").*", "\1")

Resulting dataset looks something like this

I'm sure there are other methods that can work or we can build upon this method further.

I am curious about different ways of doing this as well so will leave updates if I figure out any other methods.

Update:
Was able to shorten the SPL into a single eval by using the nifty lookup() function

| makeresults 
    | eval
        data="very personal info on John Doe: Birthday: 04/12/1973 and SSN: 123-45-6789"
    ``` get regex pattern from lookup and utilize against raw data in another field to extract data into net-new field ```
    | eval 
        ssn=replace(data, ".*(".spath(lookup("test_regex_lookup.csv", json_object("pattern_type", "SSN"), json_array("regex")), "regex").").*", "\1"),
        bday=replace(data, ".*(".spath(lookup("test_regex_lookup.csv", json_object("pattern_type", "date"), json_array("regex")), "regex").").*", "\1")

klim · ‎01-02-2024

That is a pretty good solution. But I was looking for something that wouldn't require updating the query if another regex is added to the list.

dtburrows3 · ‎01-02-2024

Alright I figured you would want the fields extracted with their intended fieldnames instead of any-and-all matches being contained in a single multivalue field so here is SPL to do that.

<base_search>
    ``` this SPL required a field named "data" containing a raw string as its value ```
    ``` this can be macroed by replacing the input field "data" and lookup name "test_regex_lookup.csv" ```
    ``` example: | `extract_regex_from_lookup(data, test_regex_lookup.csv)` ```
    ``` pull in all regex patterns as an array of json objects into the parent search as a new field ```
    | join type=left
        [ | inputlookup test_regex_lookup.csv | tojson str(pattern_type) str(regex) output_field=regex_json | stats values(regex_json) as regex_json | eval regex_array=mv_to_json_array(regex_json) | fields + regex_array ]
    ``` parse array of json objects into a multivalued field of json objects ```
    | eval
        regex_json=json_array_to_mv(regex_array)
    ``` remove array (no longer needed) ```
    | fields - regex_array
    ``` search the raw text of field "data" for matches against any of the regex patterns contained in the regex_json multivalue field ```
    | eval
        regex_match_json=case(
            mvcount(regex_json)==1, 
                if(match(data, spath(regex_json, "regex")), json_set(regex_json, "matches", replace(data, ".*(".spath(regex_json, "regex").").*", "\1")), null()),
            mvcount(regex_json)>1, mvmap(regex_json, if(match(data, spath(regex_json, "regex")), json_set(regex_json, "matches", replace(data, ".*(".spath(regex_json, "regex").").*", "\1")), null()))
            )
    ``` remove regex_json (no longer needed) ```
    | fields - regex_json
    ``` (optional) multivalued field containing all pattern matches ```
    | eval
        all_regex_matches=mvmap(regex_match_json, spath(regex_match_json, "matches"))
    ``` create temporary json object to hold key/value pairs for pattern_type attribution ```
    | eval
        tmp_json=json_object()
    ``` loop through the regex_match_json multivalue field and assign a key/value entry to "tmp_json" for the (pattern_type: matches) ```
    | foreach mode=multivalue regex_match_json
        [
            | eval
                tmp_json=json_set(tmp_json, spath('<<ITEM>>', "pattern_type"), spath('<<ITEM>>', "matches"))
            ]
    ``` full spath against tmp_json to get field extractions for all matches against the pattern_types ```
    | spath input=tmp_json
    ``` remove temporary json object (no loger needed) ```
    | fields - tmp_json
    ``` (optional) remove regex_match_json field ```
    | fields - regex_match_json
    ``` end of `extract_regex_from_lookup(2)` macro ```
    ``` table all extracted fields derived from "data" field and regex stored in lookup "test_regex_lookup.csv" ```
    | table _time, data, all_regex_matches, *

I am pretty happy with how this turned out but there may be an easier way of doing it.
Would be glad to hear anybody else to chime in on an easier way of accomplishing this.
I have just always had problems with piping in data from a lookup into a parent search as executable SPL other than pulling it into an eval of some sort.

Reference screenshot of sample output

So the current SPL will assign the match to its corresponding row's pattern_type value from the lookup as a fieldname.
In this example it is SSN, date, and name.

dtburrows3 · ‎01-02-2024

Not the prettiest solution but shouldn't require updating since its looping through all regexes from the lookup using a mvmap() function.

| makeresults 
    | eval
        data="very personal info on John Doe: Birthday: 04/12/1973 and SSN: 123-45-6789"
    | join type=left
        [ | inputlookup test_regex_lookup.csv | tojson str(pattern_type) str(regex) output_field=regex_json | stats values(regex_json) as regex_json | eval regex_array=mv_to_json_array(regex_json) | fields + regex_array ]
    | eval
        regex_json=json_array_to_mv(regex_array)
    | fields - regex_array
    | eval
        regex_patterns=case(
            mvcount(regex_json)==1, spath(regex_json, "regex"), 
            mvcount(regex_json)>1, mvmap(regex_json, spath(regex_json, "regex"))
            )
    | eval
        regex_match_json=case(
            mvcount(regex_json)==1, 
                if(match(data, spath(regex_json, "regex")), json_set(regex_json, "matches", replace(data, ".*(".spath(regex_json, "regex").").*", "\1")), null()),
            mvcount(regex_json)>1, mvmap(regex_json, if(match(data, spath(regex_json, "regex")), json_set(regex_json, "matches", replace(data, ".*(".spath(regex_json, "regex").").*", "\1")), null()))
            )
    | fields - regex_json, regex_patterns
    | eval
        all_regex_matches=mvmap(regex_match_json, spath(regex_match_json, "matches"))

Also makes another field (regex_match_json) to map back the pattern that matched the extraction for reference.

Using a lookup table to store regex patterns to be used in a search

Other

regex

rex

subsearch

Developer Spotlight with Paul Stout

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

Data-Driven Success: Splunk & Financial Services