Getting Data In

How can I anonymize fields of data that has undergone indexed extractions?

jeffland
Champion

I'm on a standalone Splunk environment. I've got some .csv files, and I'd like to use indexed extractions for them as well as pseudo-/anonymize the field "Meldender" contained in them (either via SEDCMD or with a transform).

The way I understand it, indexed extractions take place before a SEDCMD/transform is applied (based on the detailed diagram here), which is why I end up with masked data in my raw event while the (pre-SEDCMD extracted) field still contains the initial unmasked data if I simply use INDEXED_EXTRACTIONS = csv and SEDCMD.

As I would prefer not to change to search-time field extraction, I would like to change the indexed field with a transform as well. I thought this could be done with a simple props and transforms, so I tried the following. Here is my props.conf:

[stoer_csv_meta]
FIELD_NAMES = ...
KV_MODE = none
NO_BINARY_CHECK = true
PREAMBLE_REGEX = ...
SHOULD_LINEMERGE = false
TIMESTAMP_FIELDS = ...
INDEXED_EXTRACTIONS = csv
SEDCMD-meldender = a working sed
TRANSFORMS-meld = meld

and here my transforms.conf:

[meld]
REGEX = (.{0,2}).*?(.?)
FORMAT = Meldender::$1-X-$2
WRITE_META = true
SOURCE_KEY = field:Meldender

[accepted_keys]
Meld = Meldender

This made the field "Meldender" multivalued, containing the original data as the first entry and my masked data second. I had a look at the actual content of _meta, and indeed it seems the above settings just add the entry "Meldender::masked data".

I also tried changing WRITE_META to false and using DEST_KEY = _meta, one time adding $0 to my FORMAT in order to keep the existing metadata and once leaving it out entirely. The result of the first way was no change in the field "Meldender" (it still contained the unmasked data) while the second method did, just as you would expect, erase any and all fields except for "Meldender". So neither of these three attempts so far solved the problem.

I believe this question is in the same vein as this or this one, which didn't get any satisfying answers so far. I have one ugly solution coming up, but please share your thoughts!

1 Solution

Dan
Splunk Employee
Splunk Employee

I would also try (not tested):

[meld]
 REGEX = (.{0,2}).*?(.?)
 FORMAT = Meldender::$1-X-$2
 WRITE_META = true
 SOURCE_KEY = field:Meldender
 DEST_KEY = field:Meldender
[accepted_keys]
 is_valid=field:Meldender

View solution in original post

Dan
Splunk Employee
Splunk Employee

I would also try (not tested):

[meld]
 REGEX = (.{0,2}).*?(.?)
 FORMAT = Meldender::$1-X-$2
 WRITE_META = true
 SOURCE_KEY = field:Meldender
 DEST_KEY = field:Meldender
[accepted_keys]
 is_valid=field:Meldender

View solution in original post

delink
Communicator

I downvoted this post because the answer does not solve the problem. the accepted answer should be moved to jeffland's below.

0 Karma

jeffland
Champion

This is actually a really good solution. It didn't occur to me that you can access fields with field:field_name both as SOURCE_KEY and DEST_KEY at the time I wrote the question, but this should work well.

Redman11
Explorer

@jeffland @Dan Did you get this to work? I've the exact same problem, but following the approach above I always end up with a multi-value field containing the original field value and the replacement value specified by the FORMAT command. If I put FORMAT = $1, then I just get the original value. I can successfully use a SEDCMD to remove the value from _raw, so that part of the problem is fixed, but I'm struggling with the indexed field. This seems to be much harder to do than it should be! Any help would be much appreciated! I'm using 7.0.2. What I'd ideally like to do is drop the field completely, but I'm happy if I can at least mask it.

0 Karma

jeffland
Champion

The above worked for me (accessing the field in metadata with field:name and applying REGEX and FORMAT to it).
If you get a multi-valued field, you're probably using both KV_MODE and INDEXED_EXTRACTIONS in your sourcetype at the same time. Make sure that KV_MODE = none to avoid search time field extraction.
If you want to remove a field from indexed fields, you'll have to re-write the metadata information like this:

REGEX = (?m)^(.*)<your_field_name>\:\:<regex matching your field values>(.*)$
FORMAT = $1$2
WRITE_META = false
SOURCE_KEY = _meta
DEST_KEY = _meta

You can probably optimize that regular expression. If you're okay with just replacing the value of your field, this should be faster:

REGEX = .
FORMAT = <your_field_name>::-
WRITE_META = true
SOURCE_KEY = field:<your_field_name>
DEST_KEY = field:<your_field_name>
[accepted_keys]
is_valid = field:<your_field_name>

If that doesn't work, you should probably ask a new question with more details about your settings. Feel free to tag me in it.

0 Karma

Redman11
Explorer

@jeffland Thank you very much for taking the time to answer this, it's really appreciated. The first approach works - I end up with the field missing from both the indexed fields and the _raw which is exactly what I need, so thanks for that. The second method doesn't work though. I get the original value of the field as a single-value. Here is the props.conf for that test:

[MySourceType]
DATETIME_CONFIG = 
INDEXED_EXTRACTIONS = TSV
KV_MODE = none
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
TIMESTAMP_FIELDS = DateTime
TIME_FORMAT = %Y-%m-%d %H:%M:%S
category = Structured
disabled = false
pulldown_type = true
TRANSFORMS-dropMaskedCLI = dropMaskedCLI
SEDCMD-maskedCLI = s/^([\S ]+\t)(\S+\t)(\S+\t)(.+)/\1\2\4/

The transforms.conf:

[dropMaskedCLI]
REGEX = .
SOURCE_KEY = field:Masked_CLI
DEST_KEY = field:Masked_CLI
FORMAT = Masked_CLI::-
WRITE_META = true

[accepted_keys]
is_valid = field:Masked_CLI

I'm assuming you meant the hyphen in the FORMAT command as a string literal in your example? The data I'm loading looks like this (tab-separated):

DateTime    DialedNumber    Masked_CLI  WithheldFlag
2018-02-24 00:00:02 4789226712  07123456789 N

If you've any idea why this does not work I'd be interested to hear, but you've given me a working solution. Once again thanks very much for your help!

0 Karma

jeffland
Champion

You're right, it doesn't work as I said it would - when you use WRITE_META=true, it doesn't overwrite any existing fields. It just appends to _meta, same as if you used DEST_KEY=_meta with FORMAT=$0<something>. I'm sorry for the confusion.

0 Karma

Redman11
Explorer

@jeffland No probs. Thanks again for your help.

0 Karma

jeffland
Champion

So I took to a drastic way and changed my transforms to this:

[meld]
REGEX = (?m)^(.*Meldender\:\:)(.{0,2}).*?(.?)(\s.*)$
FORMAT = $1$2-X-$3$4
WRITE_META = false
SOURCE_KEY = _meta
DEST_KEY = _meta

This changes _meta instead of adding to it. I don't really know what this does to the time needed to index data, using a SEDCMD and a very ugly regex on the entire metadata on top. I'm lucky I only have to do this once in a while with small amounts of data... this can't be the solution.

delink
Communicator

This answer is the correct solution for this not the accepted answer above. I was able to get it work with the following:

[mask_ssn01_cs_uri_query]
SOURCE_KEY = _meta
REGEX = (?i)(.*(?:ssn|SearchValue)=)\d{0,5}(\d{4}.*)
DEST_KEY = _meta
WRITE_META = false
FORMAT = $1XXX-XX-$2
0 Karma