Hi Splunkers,
I am looking for some help in creation of regular expression to Anonymize data with a regular expression in a transforms.
Link: https://docs.splunk.com/Documentation/SplunkCloud/6.6.3/Data/Anonymizedata
Current Log format: Timestamp | Category | Machine | ApplicationDomain | ProcessId | ProcessName | ThreadId | LogID | UserName | ActionName | Module | AuthorizationStatus | RequestedBy | RequestingURL | QueryString | HTTPVerb | ClientIP| LogEvent="Response",MethodName="get",ActionResult="Success",ApplicationNumber="1234567890",ApplicationLanguage="1",Section="SUMMARY",FirstName="Shrelock",LastName="Holmes",Gender="M",DateOfBirth="7/19/1976",SocialSecurityNumber="123456789",MaritalStatus="0",RaceInformation="Item8",CitizenshipCode="1",County="20",AddressLine1="221 Baker Street",City="Marylebone",State="London"
I want to write regular expression for all key value pairs after which start after "ClientIP|". (i.e LogEvent, MethodName, ApplicationNumber, FirstName, DateOfBirth, SocialSecurityNumber, etc)
Note: The location of the fields may change at time but it will always be in a key value pair format. (i.e ,ApplicationNumber="1234567890",ApplicationLanguage="1",Section="SUMMARY",FirstName="Sherlock",LastName="Holmes",Gender="M",DateOfBirth="7/19/1976")
Transforms Example:
[ssn-anonymizer]
REGEX = regex to capture ssn
FORMAT = format to mask entire data
DEST_KEY = _raw
I would really appreciate all the help the community can give.
Thank You,
Nish.
Provided there's no pipe characters present in the key-value pair data, there's a way to do this using SEDCMD. The approach is to look for key-value pairs that have no pipes after them on the current line, then replace those key value pairs with masked versions. Unfortunately it's a fairly heavy regex though, so just be aware of possible performance issues.
In your input's props.conf stanza you should put:
SEDCMD-maskall = s/(\w+)="(?:(?:(?!\s*?\|).)*?)"(?!.*\|)/\1="########"/g
This will replace the values with eight hashes, and only for the values after the last pipe character. In the example below, only the last three values here would match (value4, value5 and value6), as they're the only key-value pairs after the last pipe:
BEFORE:
MyEvent | GET | key2="value2",key3="value3" | 1.2.3.4 | key4="value4",key5="value5" , key6="value6"
AFTER:
MyEvent | GET | key2="value2",key3="value3" | 1.2.3.4 | key4="########",key5="########" , key6="########"
Provided there's no pipe characters present in the key-value pair data, there's a way to do this using SEDCMD. The approach is to look for key-value pairs that have no pipes after them on the current line, then replace those key value pairs with masked versions. Unfortunately it's a fairly heavy regex though, so just be aware of possible performance issues.
In your input's props.conf stanza you should put:
SEDCMD-maskall = s/(\w+)="(?:(?:(?!\s*?\|).)*?)"(?!.*\|)/\1="########"/g
This will replace the values with eight hashes, and only for the values after the last pipe character. In the example below, only the last three values here would match (value4, value5 and value6), as they're the only key-value pairs after the last pipe:
BEFORE:
MyEvent | GET | key2="value2",key3="value3" | 1.2.3.4 | key4="value4",key5="value5" , key6="value6"
AFTER:
MyEvent | GET | key2="value2",key3="value3" | 1.2.3.4 | key4="########",key5="########" , key6="########"
Thank You for your answer. I will try to implement this approach and will let the group know on the progress. And yes i want to mask all the key value after the pipe. Will this mask data at index time? or at the presentation layer. I am assuming i have to update the props on the TA i created to on-board data.
This will mask it at index time, and yes, the local folder of your TA would be the right place to modify props.conf.
Give this a try
props.conf (indexer or heavy forwarder whichever comes first)
[yourSourceTypeHere]
..other settings..
SEDCMD-maskkvs = s/(\w+)=\"[^\"]+\"/\1/g
Thank You for your answer. I will get back to you when i try this approach.
So you want to mask all the key value pairs which comes after ClientIP OR want to retain them and mask all remaining?
I want to mask all the key value pair after ClientIP. I am sorry i didnt get second part of your question.
(OR want to retain them and mask all remaining?)
If you want to mask the data in all the fields after ClientIP
, why not just remove all that data completely from the end of the events? That will save processing and licensing costs. If they won't necessarily be after ClientIP
, then that is a different problem, but if all that data is anonymized, there seems to be little reason to even include it in the data that you are indexing.
Thank You for the reply @cpetterborg but the approach is based on our requirement and it require's masking of the data after ClientIP.