Splunk Search

Help in creating regex for encryption of data?

nishitdarade
Explorer

Hi Splunkers,

I am looking for some help in creation of regular expression to Anonymize data with a regular expression in a transforms.

Link: https://docs.splunk.com/Documentation/SplunkCloud/6.6.3/Data/Anonymizedata

Current Log format: Timestamp | Category | Machine | ApplicationDomain | ProcessId | ProcessName | ThreadId | LogID | UserName | ActionName | Module | AuthorizationStatus | RequestedBy | RequestingURL | QueryString | HTTPVerb | ClientIP| LogEvent="Response",MethodName="get",ActionResult="Success",ApplicationNumber="1234567890",ApplicationLanguage="1",Section="SUMMARY",FirstName="Shrelock",LastName="Holmes",Gender="M",DateOfBirth="7/19/1976",SocialSecurityNumber="123456789",MaritalStatus="0",RaceInformation="Item8",CitizenshipCode="1",County="20",AddressLine1="221 Baker Street",City="Marylebone",State="London"

I want to write regular expression for all key value pairs after which start after "ClientIP|". (i.e LogEvent, MethodName, ApplicationNumber, FirstName, DateOfBirth, SocialSecurityNumber, etc)

Note: The location of the fields may change at time but it will always be in a key value pair format. (i.e ,ApplicationNumber="1234567890",ApplicationLanguage="1",Section="SUMMARY",FirstName="Sherlock",LastName="Holmes",Gender="M",DateOfBirth="7/19/1976")

Transforms Example:
[ssn-anonymizer]
REGEX = regex to capture ssn
FORMAT = format to mask entire data
DEST_KEY = _raw

I would really appreciate all the help the community can give.

Thank You,
Nish.

0 Karma
1 Solution

mtulett_splunk
Splunk Employee
Splunk Employee

Provided there's no pipe characters present in the key-value pair data, there's a way to do this using SEDCMD. The approach is to look for key-value pairs that have no pipes after them on the current line, then replace those key value pairs with masked versions. Unfortunately it's a fairly heavy regex though, so just be aware of possible performance issues.

In your input's props.conf stanza you should put:

SEDCMD-maskall = s/(\w+)="(?:(?:(?!\s*?\|).)*?)"(?!.*\|)/\1="########"/g

This will replace the values with eight hashes, and only for the values after the last pipe character. In the example below, only the last three values here would match (value4, value5 and value6), as they're the only key-value pairs after the last pipe:

BEFORE:
MyEvent | GET | key2="value2",key3="value3" | 1.2.3.4 | key4="value4",key5="value5" , key6="value6"

AFTER:
MyEvent | GET | key2="value2",key3="value3" | 1.2.3.4 | key4="########",key5="########" , key6="########"

View solution in original post

mtulett_splunk
Splunk Employee
Splunk Employee

Provided there's no pipe characters present in the key-value pair data, there's a way to do this using SEDCMD. The approach is to look for key-value pairs that have no pipes after them on the current line, then replace those key value pairs with masked versions. Unfortunately it's a fairly heavy regex though, so just be aware of possible performance issues.

In your input's props.conf stanza you should put:

SEDCMD-maskall = s/(\w+)="(?:(?:(?!\s*?\|).)*?)"(?!.*\|)/\1="########"/g

This will replace the values with eight hashes, and only for the values after the last pipe character. In the example below, only the last three values here would match (value4, value5 and value6), as they're the only key-value pairs after the last pipe:

BEFORE:
MyEvent | GET | key2="value2",key3="value3" | 1.2.3.4 | key4="value4",key5="value5" , key6="value6"

AFTER:
MyEvent | GET | key2="value2",key3="value3" | 1.2.3.4 | key4="########",key5="########" , key6="########"

nishitdarade
Explorer

Thank You for your answer. I will try to implement this approach and will let the group know on the progress. And yes i want to mask all the key value after the pipe. Will this mask data at index time? or at the presentation layer. I am assuming i have to update the props on the TA i created to on-board data.

0 Karma

mtulett_splunk
Splunk Employee
Splunk Employee

This will mask it at index time, and yes, the local folder of your TA would be the right place to modify props.conf.

0 Karma

somesoni2
Revered Legend

Give this a try

props.conf (indexer or heavy forwarder whichever comes first)

    [yourSourceTypeHere]
    ..other settings..
    SEDCMD-maskkvs = s/(\w+)=\"[^\"]+\"/\1/g

nishitdarade
Explorer

Thank You for your answer. I will get back to you when i try this approach.

0 Karma

somesoni2
Revered Legend

So you want to mask all the key value pairs which comes after ClientIP OR want to retain them and mask all remaining?

0 Karma

nishitdarade
Explorer

I want to mask all the key value pair after ClientIP. I am sorry i didnt get second part of your question.
(OR want to retain them and mask all remaining?)

0 Karma

cpetterborg
SplunkTrust
SplunkTrust

If you want to mask the data in all the fields after ClientIP, why not just remove all that data completely from the end of the events? That will save processing and licensing costs. If they won't necessarily be after ClientIP, then that is a different problem, but if all that data is anonymized, there seems to be little reason to even include it in the data that you are indexing.

0 Karma

nishitdarade
Explorer

Thank You for the reply @cpetterborg but the approach is based on our requirement and it require's masking of the data after ClientIP.

0 Karma
Get Updates on the Splunk Community!

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...