Hi,
I have a requirement to mask any sensitive data, such as credit card numbers or Social Security Numbers, that might be ingested into Splunk. I can write the props to handle data masking, but the challenge is that I do not know where or if the sensitive data will appear. Although the data we currently have doesn't contain any sensitive information, compliance mandates require us to implement controls that detect and mask such data before it is ingested into Splunk. Essentially, the props need to be dynamic. Is there a way to achieve this?
Thanks.
Apart from what @gcusello and @PickleRick have said (which I agree with), this "problem" should be fixed at source. Your organisation should fully justify why sensitive data such as credit card numbers and social security numbers are being saved unobfuscated in logs for Splunk to ingest in the first place. Ideally, all such data should be stored in a secure place which is only retrievable by a few trusted people who access it for legitimate reasons. Anything else, is possibly a violation of your customer privacy protection rights.
OK. Several things here.
1. For a question starting with "what is the best way", especially if no boundary conditions are given, the answer is usually "it depends".
2. From my experience - the worse problem definition - the less reliable outcome. I've dealt with customers who wanted something just "configured so it works" (we're not necessarily talking about Splunk, just a general idea) and the result was usually less than stellar.
Your problem is rooted in the compliance but it's also equally common in DLP areas - just find something. We don't know what/where/if it is but we want you to find it.
While for some types of identifiers you can distinguish them because they are in a particular format _and_ they have some internal integrity which you can check (like IBAN numbers has control digits), others do not have it and there is either a fat chance of false positives or false negatives, depending on how creative you are with finding - for example - all those possible ways of writing a phone number.
And don't even get me started on trying to find names or addresses.
Of course, you can try to "use AI" to guess what ad where constitutes sensitive data but this will only add another layer to already excruciating headache. Even a human, having a relatively good understanding of a context, could make mistakes here now and then.
So even without getting into the gory technical details of how to implement such stuff with/around Splunk, I'd say if you want to do something like that without proper data classification and well-defined things to filter/mask you're in for a treat - a neverending project of tweaking your detection engine and dealing with stakeholders' complaints about false positives and negatives.
It's exactly like @PickleRick said. There is no way to ensure that you can do that all in one phase without data leaks. Especially when you didn't know where and which kind of data you will receive from source systems.
Then only way try to get this working even somehow it strict data onboarding and change management process with separate dev/test environment where all data and changes are first integrated. Then you need some way to analyze that data and ensure that there haven't been any data leaks which contains e.g. SSN, IBAN, or other secure PII data. And as it said the format options for those are actually countless (believe or not, but you will see it). After you are absolutely sure that your masking etc. is working then you can do production integration. But you will see that earlier or later someone will make some "emergency" change or something and then you have those events in your production 😞
After when this has happened you have unmasked data on Splunk indexes and there is no way to mask it in search time so that no-one cannot see it. Event using delete command is not enough as those events are still on buckets and if you have access to your storage layer you can get that data out there. Then only way is delete the index ensure that those are overwritten enough many times and then reingest that data.
r. Ismo
Hi @Richy_s ,
to mask sensitive data, you can follow the instructions at https://docs.splunk.com/Documentation/Splunk/9.4.0/Data/Anonymizedata
The main issue, if I correctly understood, is to identify PII and sensitive information in your data.
The best approach, in myexperiesce, is to ingest data in a temporary index (so you can delete it when you will end the analysis) and identify all the data and the regexes to filter them, then you have to apply these filters using the approach in the below link.
I don't understand what you mean whan you say "the props need to be dynamic": filter rules must be defined and used, if you have new rules, you have to add them.
Ciao.
Giuseppe
I completely agree with what you've stated below @gcusello @isoutamo @ITWhisperer @PickleRick , and I'm on the same page. However, as you know, compliance principles operate on the premise that whether an issue is present or not, it's best to assume it is and address it accordingly.
In my situation, we mainly deal with network-related data where the likelihood of finding PII is very low. Nonetheless, as a security requirement, we want to establish controls that ensure any sensitive information, if present, is masked.
Hi @Richy_s ,
as I said (and I say this aligned with my second role in my Company: privacy and ISO27001 Lead Auditor!), the only way to mask PII is to analyze your new data stored in a temporary index, finding a list of controls.
Then you can implement these rules in props and transforms, as described in the below link.
Then you can prepare an alert, to run e.g. once a day, with the same controls on all the data archived in the day.
If the alert will find something, it means that you have to extend your checks to other data.
It isn't possible to run these controls before indexing because Splunk searches run on indexed data, the only other solution could be:
The only issue is that, in this way, you duplicate the license consuption!
Ciao.
Giuseppe
OK. Let me rephrase it.
This is a typical attempt to "fix" policy issues with technical means.
Without _knowing_ where the PII is you're doomed to guess. And guessing is never accurate.
BTDTGTT