Reporting

Anonymous based on scripting

ruisantos
Path Finder

Is there a way to anonymize data based on a script/function. I want to anonymize data but would like to have an hash that I can use to perform valid reports on it.

To further extend one what I would like to have.

Currently splunk allows me to anonymize data like this: eg. replace 123456789 with XXXXXX789.

What I would like is something more like: eg. replace 123456789 with the result of function md5(123456789)=jf430fj490fj4

This would guarantee anonymity and uniqueness for reporting.

Tags (2)
0 Karma
1 Solution

Kate_Lawrence-G
Contributor

Hmm...I don't think that is something you can do natively in Splunk. The anonymize data function is limited to replacement/character substitution through either SED or REGEX.

The closest 3rd party app I see uploaded is:
http://splunk-base.splunk.com/apps/22403/adds-support-for-anonymizing-log-files-at-index-time , but I think that it's probably just character substations based on common fields found in data.

It sounds like you actually want to randomize the data with a hash or some kind of seed so that its completely unique.

I think the best bet for this would be a custom python command that accepts the raw data does a specific function and then spits out a new field based on logic external to Splunk.

Here is the link to the the Splunk doc on this:

http://docs.splunk.com/Documentation/Splunk/4.3/SearchReference/WriteaPythonsearchcommand#Examples

View solution in original post

Kate_Lawrence-G
Contributor

Hmm...I don't think that is something you can do natively in Splunk. The anonymize data function is limited to replacement/character substitution through either SED or REGEX.

The closest 3rd party app I see uploaded is:
http://splunk-base.splunk.com/apps/22403/adds-support-for-anonymizing-log-files-at-index-time , but I think that it's probably just character substations based on common fields found in data.

It sounds like you actually want to randomize the data with a hash or some kind of seed so that its completely unique.

I think the best bet for this would be a custom python command that accepts the raw data does a specific function and then spits out a new field based on logic external to Splunk.

Here is the link to the the Splunk doc on this:

http://docs.splunk.com/Documentation/Splunk/4.3/SearchReference/WriteaPythonsearchcommand#Examples

ruisantos
Path Finder

Thanks that is what I guessed. I've oppened an enhancement request for this.

0 Karma

Ayn
Legend

I disagree with that it would "guarantee" anonymity. Uniqueness, perhaps (as long as you don't manage to create a hash collision), but anonymity? It's just a matter of finding the correct string that produces the given MD5 sum. The masking approach taken by default in Splunk, on the other hand, alters the string in a way that guarantees that the original data cannot be recreated.

0 Karma

ruisantos
Path Finder

I saw that document. But that document performs a general replacement of characters.

eg. replace 123456789 with XXXXXX789.

What I would like is something more like.

eg. replace md5(123456789) with jf430fj490fj4

This would guarantee anonymity and uniqueness for reporting.

0 Karma
Get Updates on the Splunk Community!

Observability | How to Think About Instrumentation Overhead (White Paper)

Novice observability practitioners are often overly obsessed with performance. They might approach ...

Cloud Platform | Get Resiliency in the Cloud Event (Register Now!)

IDC Report: Enterprises Gain Higher Efficiency and Resiliency With Migration to Cloud  Today many enterprises ...

The Great Resilience Quest: 10th Leaderboard Update

The tenth leaderboard update (11.23-12.05) for The Great Resilience Quest is out >> As our brave ...