Anonymize data and keep original aside

eregon · ‎05-22-2020

Good afternoon fellow splunkthiasts, I need your help with data anonymization.

Situation: Application on server with UFW produces a log - most of it is a boring operational stuff, however certain records contain a field considered to be sensitive. Log records are necessary for ordinary Ops admins (who need to see all records, but don't need to see the actual sensitive field value) and privileged troubleshooters, who need to see the sensitive data, too.

Architecture: data is produced on a server with UFW, will be stored on indexer cluster and there is one heavy-forwarder available in my deployment.

Limitations:
1. Due to limited bandwidth between UFW and Splunk servers, it is preferred not to increase volume of data transferred from UFW (bandwidth between HFW and indexers is fine).
2. Due to time-constrained validity of the sensitive field, delays introduced by search->modify->index again every few minutes are not acceptable.
3. Indexing the sensitive records twice is OK. Indexing whole log twice would be too expensive fun.

Proposed solution: UFW will forward the log to heavy-forwarder where it should be duplicated. One copy of the data should be anonymized and forwarded to index "operational", while the other one should be filtered (only records with sensitive field are kept) and then forwarded to index "sensitive".

Problem: I know how to route data, how to anonymize data, how to filter data before routing, but I am not sure how to connect the dots in described manner. To be specific, I don't know how to duplicate the data on HFW and make sure each copy is treated differently.

Can you help, or possibly propose some better solution?

rnowitzki · ‎06-04-2020

I don't know if this is possible (budget, additional infrastructure) but this is a perfect use case for Cribl.

It would encrypt the sensitive data, and only this is being sent to Splunk. With the Cribl App for Splunk you will have a new custom command "decrypt" which has access to the encryption keys, so you can decrypt the data again. With roles/capabilities mapping you can only give certain roles the ability to decrypt.

https://cribl.io/blog/encrypting-sensitive-information-in-real-time-with-cribl/

--
Karma and/or Solution tagging appreciated.

jianw223 · ‎10-16-2020

This is an endorsement by a Cribl employee. As a previous user of Cribl, I would not recommend it. It is considerably slower and buggy.

rnowitzki · ‎12-01-2020

Hi @jianw223 ,

I am not a Cribl employee or in any other way related. Just a happy Cribl user who set it up for the exact same use case for a customer. Not slow, not buggy in my experience.

BR
Ralph

--
Karma and/or Solution tagging appreciated.

lloydknight · ‎05-22-2020

Hello @eregon

First time encountering the word splunkthiasts. 🙂

My answer would need you to hop on several links btw.

Below link is similar to your query which the main requirement is to anonymize the data but would need to be used as non-anonymized too for different use:
https://answers.splunk.com/answers/824299/anonymize-data-from-json-file.html

As I've suggested to check the link below, common perception for somewhat-replicated data would be a double-hit license usage"
https://answers.splunk.com/answers/690291/one-source-to-two-indexes.html

Similarly, you can actually achieve the "one data (anonymized and non-anonymized) to two indexes solution" without hitting a double license usage:
(check woodcock's answer on the link below)
https://answers.splunk.com/answers/567223/how-to-send-same-data-source-to-two-or-multiple-in-1.html

Hope it helps!

eregon · ‎06-04-2020

Hello @lloydknight and thank you for the links! I found many others, but these are really close to what I want and I missed them. They don't fully solve my problem, but brought me to an idea what to try next. Let me explain:

What I want is almost the same thing as your first link addresses, however I need to transform both of my "streams" (one should be anonymized, the other should be filtered) due to limitations described in my original post.

Example: source log is something like this:

Boring line 1
Boring line 2
Interesting line with sensitive contents 3
Boring line 4
Interesting line with sensitive contents 5

As a result, "operational" index should contain:

Boring line 1
Boring line 2
Interesting line with XXXXX 3
Boring line 4
Interesting line with XXXXX 5

while "sensitive" index should have:

Interesting line with sensitive contents 3
Interesting line with sensitive contents 5

The problem is I don't know how to duplicate the data and transform each of the copies in its own way before routing it to one or another index - all of this done on HFW. Considering your links, it is not possible in one step.

I'll rethink my options and will post an update soon.

Anonymize data and keep original aside

heavy forwarder

universal forwarder

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

Anonymize data and keep original aside

heavy forwarder

universal forwarder

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases