Is there a way to anonymize data based on a script/function. I want to anonymize data but would like to have an hash that I can use to perform valid reports on it.
To further extend one what I would like to have.
Currently splunk allows me to anonymize data like this: eg. replace 123456789 with XXXXXX789.
What I would like is something more like: eg. replace 123456789 with the result of function md5(123456789)=jf430fj490fj4
This would guarantee anonymity and uniqueness for reporting.
Hmm...I don't think that is something you can do natively in Splunk. The anonymize data function is limited to replacement/character substitution through either SED or REGEX.
The closest 3rd party app I see uploaded is:
http://splunk-base.splunk.com/apps/22403/adds-support-for-anonymizing-log-files-at-index-time , but I think that it's probably just character substations based on common fields found in data.
It sounds like you actually want to randomize the data with a hash or some kind of seed so that its completely unique.
I think the best bet for this would be a custom python command that accepts the raw data does a specific function and then spits out a new field based on logic external to Splunk.
Here is the link to the the Splunk doc on this:
http://docs.splunk.com/Documentation/Splunk/4.3/SearchReference/WriteaPythonsearchcommand#Examples
Hmm...I don't think that is something you can do natively in Splunk. The anonymize data function is limited to replacement/character substitution through either SED or REGEX.
The closest 3rd party app I see uploaded is:
http://splunk-base.splunk.com/apps/22403/adds-support-for-anonymizing-log-files-at-index-time , but I think that it's probably just character substations based on common fields found in data.
It sounds like you actually want to randomize the data with a hash or some kind of seed so that its completely unique.
I think the best bet for this would be a custom python command that accepts the raw data does a specific function and then spits out a new field based on logic external to Splunk.
Here is the link to the the Splunk doc on this:
http://docs.splunk.com/Documentation/Splunk/4.3/SearchReference/WriteaPythonsearchcommand#Examples
Thanks that is what I guessed. I've oppened an enhancement request for this.
I disagree with that it would "guarantee" anonymity. Uniqueness, perhaps (as long as you don't manage to create a hash collision), but anonymity? It's just a matter of finding the correct string that produces the given MD5 sum. The masking approach taken by default in Splunk, on the other hand, alters the string in a way that guarantees that the original data cannot be recreated.
I saw that document. But that document performs a general replacement of characters.
eg. replace 123456789 with XXXXXX789.
What I would like is something more like.
eg. replace md5(123456789) with jf430fj490fj4
This would guarantee anonymity and uniqueness for reporting.