We all know about this stuff:
https://docs.splunk.com/Documentation/SplunkCloud/latest/Data/Anonymizedata
Let's say I am cleaning up PII
but I need to leave behind something to indicate my WHY I told Splunk to do it.
Ideally, I'd like a metadata field (not inside the raw event itself) named RawModReason
or something like that.
Then, if I obscured a SSN, ( SEDCMD-obscureSSN
), I would assign a value of obscuredSSN
.
Or if I clipped out a Credit Card Number, I might assign a value of removedCCN
.
The idea is to be able to answer the question "Has this event been modified?" (and later, "For what reason?" or "In what way?") but I'd like it to be hidden enough that an auditor will be unlikely to discover that it has been modified if I use |outputcsv
or other export of data (for audit, lawsuit, whatever).
The other dicey part is creating a multi-valued
meta field. Is that even possible?
I cannot think of any way to do this flexibly but I am sure that it must be possible. Surely I am not the first person to find myself in this position!
Okay, a two phase approach should work. The first transform detects that we are going to mask the data, so it creates the meta field and sets the flag. The second phase masks the data.
[somename1]
REGEX = .+ssn=\d{5}\d{4}.*
SOURCE_KEY = _raw
FORMAT = _mymvcodes::obscuredSSN
WRITE_META = true
[somename2]
REGEX = (.+ssn=)\d{5}(\d{4}.*)
SOURCE_KEY = _raw
FORMAT = $1xxxxx$2
DEST_KEY = _raw
The below strategy will not work. sedcmd all happen in order at one time, so the flags would stay in the data.
Very interesting question. The rule runs on every event, but which ones actually obscured anything, and how many items were obscured, would not be immediately apparent.
Here's a strategy. There may be something more direct, but let's pretend for a moment that Splunk only has access to what's in Dal's head.
Our Anonymization takes place in three phases.
Phase 1 - anonymize each field, and additionally place a marked code in its place.
Phase 2 - extract ALL the marked codes to an mv metadata field
Phase 3 - delete all the marked codes.
So, for phase 1, instead of this
[source::.../accounts.log]
SEDCMD-accounts = s/ssn=\d{5}(\d{4})/ssn=xxxxx\1/g s/cc=(\d{4}-){3}(\d{4})/cc=xxxx-xxxx-xxxx-\2/g
...you might do this (just add extra codes marked by !!##= something ##!!...)
[source::.../accounts.log]
SEDCMD-accounts = s/ssn=\d{5}(\d{4})/ssn=!!##=obscuredSSN.##!!xxxxx\1/g s/cc=(\d{4}-){3}(\d{4})/cc=!!##=obscuredCCN##!!xxxx-xxxx-xxxx-\2/g
For phase 2, you might have this...
[some name]
REGEX = !!##=(\w+)##!!
SOURCE_KEY = _raw
FORMAT = Masked Type $1
DEST_KEY = _mymvcodes
For Phase 3, you have this
SEDCMD-killem = s/!!##=\w+##!!//g
That should work, assuming the extract can be made to occur between the two SEDCMDs.