Solved: Regex extraction advice for phone numbers - Replac...

jhall0007 · ‎03-24-2017

Hello,

I am trying to extract and normalize some phone numbers that are appearing in inconsistent ways. Below I attempted to recreate a realistic example of what my data looks like. It contains multi values, special characters and numbers of varying lengths. I would prefer to do this at search time in my props.conf / transforms.

Ideally I'd like to use something similar to a transforms statement that says, start at a quotation mark, read all digits, stop at the next quotation mark.

I had considered doing this the with the following config but it appears to not be able to handle multivalued fields. Could I please get some suggestions on how to correct my config or a more efficient way to go about this?

In props.conf:

EXTRACT-my_stanza
EVAL-clean_numbers = replace(phone_number, "\D", "")

In transforms.conf:

[my_stanza]
SOURCE_KEY = 
REGEX = \"(?\d+[^\"])
MV_ADD = true

Examples:

Log 1:

"(223) 456-0001"

Log 2:

"223-456 0002","(223)456-0003 1234"
"223-456 0101","223-456-0102"

Log 3:

"223-456-0004"

Log 4:

"234560005","(223)4560006","223-456-0007"

Log 5:

"1223456-0008"

Desired results:

Log 1:

1234560001

Log 2:

1234560002
1234560003

Log 3:

1234560004

Log 4:

1234560005
1234560006
1234560007

Log 5:

1234560008

woodcock · ‎03-25-2017

You need to realize that field extractions may only contain contiguous substrings of the _raw field; it is not possible to extract fields where characters in the middle are dropped, nor where characters anywhere are modified.

Entirely new fields may be created with calcluated fields or with SPL inside of a search that do those things (both are search-time operations) but since this would require multiple eval calls in sequence, and the EVAL parser processes all lines in any props.conf in parallel we cannot use that option. So here is the only way to do it:

In props.conf

REPORT-phone_numbers

In transforms.conf:

[phone_numbers]
REGEX = "([^"]+)
FORMAT = phone_numbers::$1
MV_ADD = true

To fully normalize, you will need to clean the extra punctuation from inside your search like this:

... | rex field=phone_numbers mode=sed "s/[()\-\s]//g"

View solution in original post

woodcock · ‎03-25-2017

You need to realize that field extractions may only contain contiguous substrings of the _raw field; it is not possible to extract fields where characters in the middle are dropped, nor where characters anywhere are modified.

Entirely new fields may be created with calcluated fields or with SPL inside of a search that do those things (both are search-time operations) but since this would require multiple eval calls in sequence, and the EVAL parser processes all lines in any props.conf in parallel we cannot use that option. So here is the only way to do it:

In props.conf

REPORT-phone_numbers

In transforms.conf:

[phone_numbers]
REGEX = "([^"]+)
FORMAT = phone_numbers::$1
MV_ADD = true

To fully normalize, you will need to clean the extra punctuation from inside your search like this:

... | rex field=phone_numbers mode=sed "s/[()\-\s]//g"

jhall0007 · ‎03-27-2017

Hello,

I appreciate your comment.

The problem is your suggestion requires multiple eval steps and calculated fields are all executed in parallel when entered into props.conf.

I had done something pretty similar to your Rex mode-sed option which works fine - the only problem is 1 - I was hoping to simplify this for my users and 2 - I was hoping for a more efficient method that didn't require pulling the data into memory.

Again, thank you for responding to my question.

https://docs.splunk.com/Documentation/Splunk/6.5.2/SearchReference/CommonEvalFunctions
"All EVAL- configurations within a single props.conf stanza are processed in parallel, rather than in any particular sequence. This means you can't "chain" calculated field expressions, where the evaluation of one calculated field is used in the expression for another calculated field.

Calculated fields can reference all types of field extractions as well as field aliases. They cannot reference lookups, event types, or tags. "

woodcock · ‎04-04-2017

Hm; when did that happen? I could have sworn that it used to be top-to-bottom serially but the dox are clear. I will update my answer according to:

https://docs.splunk.com/Documentation/Splunk/6.5.2/Knowledge/definecalcfields

Regex extraction advice for phone numbers - Replace for multivalued fields

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)