Solved: Anonymous based on scripting

ruisantos · ‎01-30-2012

Is there a way to anonymize data based on a script/function. I want to anonymize data but would like to have an hash that I can use to perform valid reports on it.

To further extend one what I would like to have.

Currently splunk allows me to anonymize data like this: eg. replace 123456789 with XXXXXX789.

What I would like is something more like: eg. replace 123456789 with the result of function md5(123456789)=jf430fj490fj4

This would guarantee anonymity and uniqueness for reporting.

Kate_Lawrence-G · ‎01-31-2012

Hmm...I don't think that is something you can do natively in Splunk. The anonymize data function is limited to replacement/character substitution through either SED or REGEX.

The closest 3rd party app I see uploaded is:
http://splunk-base.splunk.com/apps/22403/adds-support-for-anonymizing-log-files-at-index-time , but I think that it's probably just character substations based on common fields found in data.

It sounds like you actually want to randomize the data with a hash or some kind of seed so that its completely unique.

I think the best bet for this would be a custom python command that accepts the raw data does a specific function and then spits out a new field based on logic external to Splunk.

Here is the link to the the Splunk doc on this:

http://docs.splunk.com/Documentation/Splunk/4.3/SearchReference/WriteaPythonsearchcommand#Examples

View solution in original post

Kate_Lawrence-G · ‎01-31-2012

Hmm...I don't think that is something you can do natively in Splunk. The anonymize data function is limited to replacement/character substitution through either SED or REGEX.

The closest 3rd party app I see uploaded is:
http://splunk-base.splunk.com/apps/22403/adds-support-for-anonymizing-log-files-at-index-time , but I think that it's probably just character substations based on common fields found in data.

It sounds like you actually want to randomize the data with a hash or some kind of seed so that its completely unique.

I think the best bet for this would be a custom python command that accepts the raw data does a specific function and then spits out a new field based on logic external to Splunk.

Here is the link to the the Splunk doc on this:

http://docs.splunk.com/Documentation/Splunk/4.3/SearchReference/WriteaPythonsearchcommand#Examples

ruisantos · ‎01-31-2012

Thanks that is what I guessed. I've oppened an enhancement request for this.

Kate_Lawrence-G · ‎01-30-2012

Yes you can: http://docs.splunk.com/Documentation/Splunk/latest/Data/Anonymizedatausingconfigurationfiles

Ayn · ‎01-31-2012

I disagree with that it would "guarantee" anonymity. Uniqueness, perhaps (as long as you don't manage to create a hash collision), but anonymity? It's just a matter of finding the correct string that produces the given MD5 sum. The masking approach taken by default in Splunk, on the other hand, alters the string in a way that guarantees that the original data cannot be recreated.

ruisantos · ‎01-31-2012

I saw that document. But that document performs a general replacement of characters.

eg. replace 123456789 with XXXXXX789.

What I would like is something more like.

eg. replace md5(123456789) with jf430fj490fj4

This would guarantee anonymity and uniqueness for reporting.

Anonymous based on scripting

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!