Splunk Search

Help to modify existing regex to mask senstive PII?

smakwana
Engager

Hi Splunkers,

I am looking for some help in modifying current regex to meet our updated project criteria.

Link: https://docs.splunk.com/Documentation/SplunkCloud/6.6.3/Data/Anonymizedata

Current Log format: Value1 | Value2 | Value3 | Value4 | Value5 | Value6 | Value7 | Value8 | Value9 | Value10 | Value11 | Value12 | ClientIP| 
LogEvent="Response",MethodName="get.complete",ActionResult="Success",ApplicationNumber="1234567890",ApplicationLanguage="1",Section="SUMMARY",FirstName="jhon",LastName="doe",Gender="M",DateOfBirth="7/19/1993",SocialSecurityNumber="123456789",MaritalStatus="0",RaceInformation="Item8",CitizenshipCode="1",County="20",AddressLine1="221 Street",City="Washington",State="USA" 

I want to write a regular expression to mask all key value pairs basically PII data which start after ,MethodName="get.complete", (i.e ApplicationNumber, FirstName, DateOfBirth, SocialSecurityNumber, MaritalStatus ,etc)

Order of the field till Method name is constant and is never changing. Every event would have exact order till “MethodName” and additional PII elements added after the “MethodName”.

Note: The location of the fields to masked may change at time but it will always be in a key value pair format. (i.e ,ApplicationNumber="1234567890",ApplicationLanguage="1",Section="SUMMARY",FirstName="Sherlock",LastName="Holmes",Gender="M",DateOfBirth="7/19/1976")

Following are the solution I was planning to use to mask data at index time.

PROPS Example Using SEDCMD Regex:

[sourcetype]
**SEDCMD-mask = regex to skip first three key-value pair and mask rest

OR**

Transforms Example Using regex:

[ssn-anonymizer]
REGEX = regex to capture ssn
FORMAT = format to mask entire data
DEST_KEY = _raw

Current approaches not fulfilling our request.
1 Below expression is dropping all values after MethodName instead of masking them.

SEDCMD-maskPHI = s/(MethodName=\"[^\"]+\",).*$/\1/g 

2 Below regex is masking all key value pairs after the last |. But we need to mask everything only after the MethodName="get.complete".

SEDCMD-maskall = s/(\w+)="(?:(?:(?!\s*?\|).)*?)"(?!.*\|)/\1="########"/g 

Thank you for all of your help and advice.

[Edit: fixed formatting and used the code button so characters no longer are being eaten.]

1 Solution

harsmarvania57
SplunkTrust
SplunkTrust

Hi @smakwana,

If you would like to use props.conf and transforms.conf then please use below configuration on Indexer/Heavy Forwarder whichever comes first. You can test below regex with your sample data here https://regex101.com/r/F6zv8u/1

props.conf

[yoursourcetype]
TRANSFORMS-anonymize = PII-anonymizer

transforms.conf

    [PII-anonymizer]
     REGEX = (?m)^(.*MethodName=\"get\.complete\").*(.*)$
     FORMAT = $1#######$2
     DEST_KEY = _raw

EDIT1: Updated transforms.conf configuration.
EDIT2: If you want to you sed then you can use below regex

\b(?:(?!LogEvent|MethodName)(\w+))\b="(?:(?:.)*?)"

So your SED configuration will be

SEDCMD-maskall = s/\b(?:(?!LogEvent|MethodName)(\w+))\b="(?:(?:.)*?)"/\1="########"/g

For testing purpose I have made below query based on your data

| makeresults
| eval _raw="Current Log format: Value1 | Value2 | Value3 | Value4 | Value5 | Value6 | Value7 | Value8 | Value9 | Value10 | Value11 | Value12 | ClientIP| 
 LogEvent=\"Response\",MethodName=\"get.complete\",ActionResult=\"Success\",ApplicationNumber=\"1234567890\",ApplicationLanguage=\"1\",Section=\"SUMMARY\",FirstName=\"jhon\",LastName=\"doe\",Gender=\"M\",DateOfBirth=\"7/19/1993\",SocialSecurityNumber=\"123456789\",MaritalStatus=\"0\",RaceInformation=\"Item8\",CitizenshipCode=\"1\",County=\"20\",AddressLine1=\"221 Street\",City=\"Washington\",State=\"USA\""
 | rex mode=sed "s/\b(?:(?!LogEvent|MethodName)(\w+))\b=\"(?:(?:.)*?)\"/\1="########"/g"

Which is giving below result

Current Log format: Value1 | Value2 | Value3 | Value4 | Value5 | Value6 | Value7 | Value8 | Value9 | Value10 | Value11 | Value12 | ClientIP| 
 LogEvent="Response",MethodName="get.complete",ActionResult=########,ApplicationNumber=########,ApplicationLanguage=########,Section=########,FirstName=########,LastName=########,Gender=########,DateOfBirth=########,SocialSecurityNumber=########,MaritalStatus=########,RaceInformation=########,CitizenshipCode=########,County=########,AddressLine1=########,City=########,State=########

View solution in original post

harsmarvania57
SplunkTrust
SplunkTrust

Hi @smakwana,

If you would like to use props.conf and transforms.conf then please use below configuration on Indexer/Heavy Forwarder whichever comes first. You can test below regex with your sample data here https://regex101.com/r/F6zv8u/1

props.conf

[yoursourcetype]
TRANSFORMS-anonymize = PII-anonymizer

transforms.conf

    [PII-anonymizer]
     REGEX = (?m)^(.*MethodName=\"get\.complete\").*(.*)$
     FORMAT = $1#######$2
     DEST_KEY = _raw

EDIT1: Updated transforms.conf configuration.
EDIT2: If you want to you sed then you can use below regex

\b(?:(?!LogEvent|MethodName)(\w+))\b="(?:(?:.)*?)"

So your SED configuration will be

SEDCMD-maskall = s/\b(?:(?!LogEvent|MethodName)(\w+))\b="(?:(?:.)*?)"/\1="########"/g

For testing purpose I have made below query based on your data

| makeresults
| eval _raw="Current Log format: Value1 | Value2 | Value3 | Value4 | Value5 | Value6 | Value7 | Value8 | Value9 | Value10 | Value11 | Value12 | ClientIP| 
 LogEvent=\"Response\",MethodName=\"get.complete\",ActionResult=\"Success\",ApplicationNumber=\"1234567890\",ApplicationLanguage=\"1\",Section=\"SUMMARY\",FirstName=\"jhon\",LastName=\"doe\",Gender=\"M\",DateOfBirth=\"7/19/1993\",SocialSecurityNumber=\"123456789\",MaritalStatus=\"0\",RaceInformation=\"Item8\",CitizenshipCode=\"1\",County=\"20\",AddressLine1=\"221 Street\",City=\"Washington\",State=\"USA\""
 | rex mode=sed "s/\b(?:(?!LogEvent|MethodName)(\w+))\b=\"(?:(?:.)*?)\"/\1="########"/g"

Which is giving below result

Current Log format: Value1 | Value2 | Value3 | Value4 | Value5 | Value6 | Value7 | Value8 | Value9 | Value10 | Value11 | Value12 | ClientIP| 
 LogEvent="Response",MethodName="get.complete",ActionResult=########,ApplicationNumber=########,ApplicationLanguage=########,Section=########,FirstName=########,LastName=########,Gender=########,DateOfBirth=########,SocialSecurityNumber=########,MaritalStatus=########,RaceInformation=########,CitizenshipCode=########,County=########,AddressLine1=########,City=########,State=########

harsmarvania57
SplunkTrust
SplunkTrust

In given solution transforms.conf example mask everything after MethodName="get.complete", so please use SED option which works perfectly fine irrespective of location of fields ApplicationNumber, FirstName ..... etc.

0 Karma

smakwana
Engager

@harsmarvania57..thank you so much. It resolved our issue.

0 Karma

harsmarvania57
SplunkTrust
SplunkTrust

Feel free to upvote my answer if it really helps. 😛

0 Karma

nishitdarade
Explorer

@harsmarvania57 I had the same issue and this solved it. Thank You. 🙂

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...