I want to anonymize user data (for example email adresses) at searchtime and tried a couple of ways.
I tried the rex command
rex mode=sed "s/(\w+.?\w+?)@email@example.com/g"
which works but does not modify the raw event at search time. The result is, that if a user selects "show source" he can see the original mail address again. Also a defined field will show the original mail address.
The other problem is, that all the reports a boring because all our internal mail adresses will be replaced with xxxxx.
I'm looking for some way replace the username of the mail address with a hash code of the username. But could not find anything like this. I also saw in the splunkbase a solution to do a des or 3des encryption of a specific field (http://splunk-base.splunk.com/apps/22393/encrypt-and-decrypt-data-within-events) but this will not work in my environment because all events came in from forwarders or by syslog and on the forwarders I'm not allowed to install such functions because of performance issues.
In version 4.2 I found a new command mappy which allows to run short python scripts but looks like it does not support all python modules and options. I tried to use mappy and the python command re.sub but could not find any working "one line" command which will replace the string extracted by the rex with it's hash code.
Does anyone found a way to anonymize user data in splunk with hash codes or something like this.
Splunk does not have a feature to modify the value of _raw (the text of the event) at search time in a way that users cannot ever get access to the original value. You could try to, for convenience cases, create a calculated field _raw that replaces the original _raw, but that won't prevent users from being able to get the original value of _raw by doing things like overriding your rule.
If you need to anonymize data that users will see, you need to put an anonymized version of the data in an index they have access to, and not give them access to an index containing the non-anonymized data. Avoiding putting it in ANY index would do the job, as would putting it in an index that their role is not permitted access to.
In order to anonymize data at input time, you can use a traditional regex transform or a SEDCMD. If you want to create an anonymized version of the data at a later point, you can try to get summary indexing to do this for you -- producing modified data to go into an alternate index -- but it's somewhat fragile and I don't recommend it.
Great response, I am new to splunk sd I,m not sure how to go about creating a new index for the purposes of anonymizing. I am using the props.conf method at the moment with source type and a sed command to replace the data but I can't seem to get it to work. I followed the KB article below and have done a rolling restart on all the indexers but still the data is not masked.
This wasn't an answer to the question, so I moved to a comment.
However, it's really an independent question. I tried reopening it as such, but it didn't work.
I suggest you do the following:
Sorry I didn't mean to add it as an answer, I thought if I could resolve my issue it may help the original poster to anonymise before indexing so he wouldn't need to worry about doing it at search time.
I have already created a separate post but haven't had any responses which is why I posted here.
Sorry for posting this as an answer again but I don't see how to add as a comment.
Well I misunderstood, because you linked to information about how to anonymize data at input/index time. If you want to anonymize data at search time you already have the answer that you basically can't.
+1 @jrodman for the suggesting the trick to replace _raw. It does not mask the data in source data, however, it does help us to mask the data in search raw events output. For e.g. to mask the value of field "value", you can mask the data in the raw events and the extracted/selected fields which are the most visible views.
.... | eval _raw=replace(_raw,"(value=)[0-9]*","\1xxxx") | eval value=replace(value,".*","xxxx") ...
Please understand and be aware - any technique you use to "mask" or "anonymize" data at search time is flawed and easily defeated. As long as users have access to be able to run an ad-hoc search, then they will be able to find a way around your attempts to anonymize data at search time.
When @jrodman said "you can't anonymize at search time", he meant that there's no way to make a search-time anonymizer that is robust enough to prevent people going around it.
Making a claim that you have "protected data" in this way is perhaps duplicitous. I would not recommend it.