A customer was using Splunk "scrub" command to anonymize sensitive data (eg user name) at search time. While this worked well, they found names were not anonymized at all. They wrote a search to highlight these (index=test | table _time user | eval _user=user | scrub user | eval orig_user=_user | stats values(user) as users count by orig_user)
As we can see Splunk does a good job of anonymizing the user names Except for "Sarah Hardy" with is only partially anonymized and "Mike Smith" which is not anonymized at at all.
This is actually a documentation issue and in fact scrub is actually as intended (but not as documented). I will try to explain 🙂
The "scrub" and "splunk anonymize" (used to anonymize diags) commands share a common library
The scrub documentation states:
Description: Specify a filename that includes the public terms to be anonymized.
Description: Specify a filename that includes the private terms to be anonymized.
Description: Specify a filename that includes names to be anonymized.
Description: Specify a filename that includes a dictionary of terms to be anonymized.
Description: Specify a filename that includes time configurations to be anonymized.
Description: Specify an application that contains the alternative files to use for anonymizing, instead of using the built-in anonymizing files.
The anonymize command states
public-terms file containing a list of locally-used words to NOT anonymize
private-terms file containing a list of words to anonymize
name-terms file containing a list of common English personal
names that Splunk uses to anonymize names with
dictionary file containing a global list of commonly-used
words to NOT anonymize - unless they are in the
timestamp-config file that determines how timestamps are parsed
Note that dictionary and public-terms in the anonymize documentation are documented as having the OPPOSITE affect as those in scrub. The correct action is in the anonymize documentation, ie Dictionary.txt and public-terms.txt contain a list of words NOT to anonymize unless they are in private-terms.txt
Surnames like Smith and Hardy are included in dictionary.txt as "smith" is an noun and a verb and "hardy" is an adverb.
"Mike Smith" fails on two accounts as both "smith" and "mike" are included in dictionary.txt. Adding "Mike Smith" to private-terms.txt resolves the issue.