Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

ITWhisperer
SplunkTrust
SplunkTrust

This challenge was first posted on Slack #puzzles channel

For BORE at .conf23, we had a puzzle question which was to obfuscate email addresses with the same number of characters. Rather than providing spoilers (in case we run BORE again and allow previous questions to be answered), I have devised another puzzle on similar lines.

Replace each of the non-space characters in the street address field with an asterisk (*), just using a single regular expression (rex command). The events in question are of fixed length with pipes (|) to delimit the fields. The street address is in the third field.

For example:

|17548|Paint Branch Park   |12005 Old Columbia Pike|Silver Spring|20904|
|17312|Quebec Terrace, 1008|1008 Quebec Ter        |Silver Spring|20903|
|17171|Cambridge Square    |4901 Battery Ln        |Bethesda     |20814|

Should become:

|17548|Paint Branch Park   |***** *** ******** ****|Silver Spring|20904|
|17312|Quebec Terrace, 1008|**** ****** ***        |Silver Spring|20903|
|17171|Cambridge Square    |**** ******* **        |Bethesda     |20814|

For the full set of complete test strings, please follow this link to regex101.com 

This article contains spoilers!

In fact, the whole article is a spoiler as it contains solutions to the puzzle. If you are trying to solve the puzzle yourself and just want some pointers to get you started, stop reading when you have enough, and return if you get stuck again, or just want to compare your solution to mine!

Where to start?

Perhaps, the first thing to notice is that this solution requires a substitution (mode=sed for the rex command), and the second thing to notice is that each non-space character has to be replaced. The implication of this is that the match must contain only one non-space character. So, the basic part of the substitution would be like this:

s/\S/*/g

https://regex101.com/r/TgifxQ/1

The third thing to notice is that the substitution needs to only be applied to a field, so should not match to the field delimiters.

s/[^\s\|]/*/g

https://regex101.com/r/TgifxQ/2

But where does the street address start? Since each field has pipe delimiters, the first two fields should be ignored:

s/^\|[^\|]*\|[^\|]*\|[^\s\|]/*/g

https://regex101.com/r/TgifxQ/3

Or, since the fields have the same pattern, to put it another way:

s/^\|([^\|]*\|){2}[^\s\|]/*/g

https://regex101.com/r/TgifxQ/4

However, as you can see, this includes the first two fields in the substitution. To fix this, we need to capture the first two fields and include them in the substitution:

s/(^\|([^\|]*\|){2})[^\s\|]/\1*/g

https://regex101.com/r/TgifxQ/5

One thing to note here is that the regex101.com expression linked to uses the PCRE (PHP <7.3) FLAVOR (which is not the default) so that it allows \1 to be used instead of $1 which then matches the way the Splunk rex command works.

As you can see, we have now managed to substitute the first non-space character of the third field, but what about the other characters in the field. These other characters are not preceded by just the first two fields, there are also the earlier characters from the third field. To rectify this, we can try making the first two fields optional.

s/(^\|([^\|]*\|){2})?[^\s\|]/\1*/g

https://regex101.com/r/TgifxQ/6

This now gets us all the non-space characters in the third field substituted. Unfortunately, it also substitutes all the other non-space characters in the subsequent fields.

Where to stop?

From the definition of the event format, we know that the field is delimited by a pipe, so we could simply look forward until we reach the next pipe character:

s/(^\|([^\|]*\|){2})?[^\s\|](?=[^\|]*\|)/\1*/g

https://regex101.com/r/TgifxQ/7

However, this does not work because this also matches all the other subsequent fields! But, this event format has only 19 fields, so we could look ahead to make sure the current field (street address) completes, and it is followed by a further 16 pipe-delimited fields, finishing at the end of the event.

s/(^\|([^\|]*\|){2})?[^\s\|](?=([^\|]*\|){17}$)/\1*/g

https://regex101.com/r/TgifxQ/8

This seems to have done the trick, but the expression looks a bit bloated. Can we optimise it?

Optimisation?

The current expression takes over 125k steps to complete the match, and is 45 characters long. Since we are already looking forward for the next 17 fields to complete before the end, we no longer need to skip over the first two fields (and they are no longer needed in the substitution):

s/[^\s\|](?=([^\|]*\|){17}$)/*/g

https://regex101.com/r/TgifxQ/9

However, while this still completes the substitution, it actually takes more steps (over 143k), mainly because it no longer has a starting anchor for the first character of the field. So, how else can we detect the end of the third field? It just so happens, for this event format, that the fourth field is always 18 characters wide! This allows us to just look forward to make sure the next field is 18 (non-pipe) characters wide (followed by a pipe):

s/[^\s\|](?=[^\|]*\|[^\|]{18}\|)/*/g

https://regex101.com/r/TgifxQ/10

This works because there is only one field which is 18 characters wide, i.e. it is a unique anchor for the match, and it brings the step count down to just over 16.3k, an overall saving of about 87%!

Summary

In summary, the key to this substitution is being able to define a class for the single character you want to substitute / obfuscate, and to define a unique forward anchor to know when to stop substituting characters.

PickleRick
SplunkTrust
SplunkTrust

Let me add some different onlook on this riddle. Not a different solution, but maybe a bit different approach of getting there.

First thing, which might not be obvious for newcomers to regexes - while a natural initial thought would be to match the whole street part (the part between third and fourth pipe delimiter), you can't just do that because the replacement string must be of fixed width. You could of course replace the whole street part with a string of asterisks but that's not what we're after. We want to replace only letters.

That means we need to replace only a single letter each time.

That makes the core of this challenge - how to match a single letter but only in specific context? Well, this is where lookaheads and lookbehinds come into play. With them we can actually try to "check" the surroundings of our match without actually consuming the search space. And that's what we need here. We need to match our non-whitespace character which is not a pipe either (this part is easy)

[^\s|]

but make it match only if it's in the street part of our event.

As a side note - strictly formally, the result will not be a regular expression in language theory terms.

But we have PCRE at our disposal and PCRE has some useful constructs and at this point we don't care whether ethey are "formally" regular expressions 😉

So, circling back to our lookaheads and lookbehinds - we hit another brick wall trying to intuitively anchor our match to the beginning of the event. It would be very easy to use lookbehind and match only after a string containing three pipe characters (or even better, we could use the fact that the fields are of constant width and just count the characters up to the third pipe) and we'd end up with something like

(?<=^.{50}\|[^|]+)[^\s|]

(the 50 should be adjusted to the actual number of characters; I was too lazy to count 😁) - the lookbehind would match the part up to the pipe and additional non-pipe characters after that and we'd accept any single non-space character after that.

And it would be a very good idea but there is one issue with it - it won't work.

Why is it so? Because the lookbehinds must have a constant width. This way we could only match the first character in the street part. So that's not the way to go.

Luckily the lookahead does not have the limitation of fixed-width. So we can use the fact that after our matching letter we need to have a string with a well-defined contents. The initial naive approach of just counting pipe characters for the remaining fields gets us to this lookahead matching anything up to our first pipe character and then 16 more pipes because we have that many more fields.

(?=[^|]*\|([^|]+\|){16}$) 

This approach will work but it will be very ineffective (requires almost 142k steps to match our sample data). But we can use the fact that the fields are of constant size.

Simple replacement of 16 repetitions of the pipe-ending field with a constant number of characters does wonders.

(?=[^|]*\|.{135}$)

This lookahead drops us down to 16173 steps.

So the final pattern to match and replace for a single asterisk would be 

[^| ](?=[^|]*\|.{135}$)

 

Contributors
Get Updates on the Splunk Community!

Upcoming Webinar: Unmasking Insider Threats with Slunk Enterprise Security’s UEBA

Join us on Wed, Dec 10. at 10AM PST / 1PM EST for a live webinar and demo with Splunk experts! Discover how ...

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM ...

If you’re unfamiliar, .conf is Splunk’s premier event where the Splunk community, customers, partners, and ...

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey

There’s something special about this time of year—maybe it’s the glow of the holidays, maybe it’s the ...