Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

ITWhisperer
SplunkTrust
SplunkTrust

This challenge was first posted on Slack #puzzles channel

For BORE at .conf23, we had a puzzle question which was to obfuscate email addresses with the same number of characters. Rather than providing spoilers (in case we run BORE again and allow previous questions to be answered), I have devised another puzzle on similar lines.

Replace each of the non-space characters in the street address field with an asterisk (*), just using a single regular expression (rex command). The events in question are of fixed length with pipes (|) to delimit the fields. The street address is in the third field.

For example:

|17548|Paint Branch Park   |12005 Old Columbia Pike|Silver Spring|20904|
|17312|Quebec Terrace, 1008|1008 Quebec Ter        |Silver Spring|20903|
|17171|Cambridge Square    |4901 Battery Ln        |Bethesda     |20814|

Should become:

|17548|Paint Branch Park   |***** *** ******** ****|Silver Spring|20904|
|17312|Quebec Terrace, 1008|**** ****** ***        |Silver Spring|20903|
|17171|Cambridge Square    |**** ******* **        |Bethesda     |20814|

For the full set of complete test strings, please follow this link to regex101.com 

This article contains spoilers!

In fact, the whole article is a spoiler as it contains solutions to the puzzle. If you are trying to solve the puzzle yourself and just want some pointers to get you started, stop reading when you have enough, and return if you get stuck again, or just want to compare your solution to mine!

Where to start?

Perhaps, the first thing to notice is that this solution requires a substitution (mode=sed for the rex command), and the second thing to notice is that each non-space character has to be replaced. The implication of this is that the match must contain only one non-space character. So, the basic part of the substitution would be like this:

s/\S/*/g

https://regex101.com/r/TgifxQ/1

The third thing to notice is that the substitution needs to only be applied to a field, so should not match to the field delimiters.

s/[^\s\|]/*/g

https://regex101.com/r/TgifxQ/2

But where does the street address start? Since each field has pipe delimiters, the first two fields should be ignored:

s/^\|[^\|]*\|[^\|]*\|[^\s\|]/*/g

https://regex101.com/r/TgifxQ/3

Or, since the fields have the same pattern, to put it another way:

s/^\|([^\|]*\|){2}[^\s\|]/*/g

https://regex101.com/r/TgifxQ/4

However, as you can see, this includes the first two fields in the substitution. To fix this, we need to capture the first two fields and include them in the substitution:

s/(^\|([^\|]*\|){2})[^\s\|]/\1*/g

https://regex101.com/r/TgifxQ/5

One thing to note here is that the regex101.com expression linked to uses the PCRE (PHP <7.3) FLAVOR (which is not the default) so that it allows \1 to be used instead of $1 which then matches the way the Splunk rex command works.

As you can see, we have now managed to substitute the first non-space character of the third field, but what about the other characters in the field. These other characters are not preceded by just the first two fields, there are also the earlier characters from the third field. To rectify this, we can try making the first two fields optional.

s/(^\|([^\|]*\|){2})?[^\s\|]/\1*/g

https://regex101.com/r/TgifxQ/6

This now gets us all the non-space characters in the third field substituted. Unfortunately, it also substitutes all the other non-space characters in the subsequent fields.

Where to stop?

From the definition of the event format, we know that the field is delimited by a pipe, so we could simply look forward until we reach the next pipe character:

s/(^\|([^\|]*\|){2})?[^\s\|](?=[^\|]*\|)/\1*/g

https://regex101.com/r/TgifxQ/7

However, this does not work because this also matches all the other subsequent fields! But, this event format has only 19 fields, so we could look ahead to make sure the current field (street address) completes, and it is followed by a further 16 pipe-delimited fields, finishing at the end of the event.

s/(^\|([^\|]*\|){2})?[^\s\|](?=([^\|]*\|){17}$)/\1*/g

https://regex101.com/r/TgifxQ/8

This seems to have done the trick, but the expression looks a bit bloated. Can we optimise it?

Optimisation?

The current expression takes over 125k steps to complete the match, and is 45 characters long. Since we are already looking forward for the next 17 fields to complete before the end, we no longer need to skip over the first two fields (and they are no longer needed in the substitution):

s/[^\s\|](?=([^\|]*\|){17}$)/*/g

https://regex101.com/r/TgifxQ/9

However, while this still completes the substitution, it actually takes more steps (over 143k), mainly because it no longer has a starting anchor for the first character of the field. So, how else can we detect the end of the third field? It just so happens, for this event format, that the fourth field is always 18 characters wide! This allows us to just look forward to make sure the next field is 18 (non-pipe) characters wide (followed by a pipe):

s/[^\s\|](?=[^\|]*\|[^\|]{18}\|)/*/g

https://regex101.com/r/TgifxQ/10

This works because there is only one field which is 18 characters wide, i.e. it is a unique anchor for the match, and it brings the step count down to just over 16.3k, an overall saving of about 87%!

Summary

In summary, the key to this substitution is being able to define a class for the single character you want to substitute / obfuscate, and to define a unique forward anchor to know when to stop substituting characters.

Get Updates on the Splunk Community!

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI!Discover how Splunk’s agentic AI ...

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Watch On Demand the Tech Talk on November 6 at 11AM PT, and empower your SOC to reach new heights! Duration: ...

Splunk Observability as Code: From Zero to Dashboard

For the details on what Self-Service Observability and Observability as Code is, we have some awesome content ...