First off, TL;DR: How to best anonymize/hash/encrypt parts of _raw while keeping everything else as-is?
I've got various sources with well-known identifiers that shall not be readable in the clear for the average splunk user, while the sources themselves need to be searchable and events need to be correlate-able for individual identifiers. For example, if I have events like this
source A: something happened for id=1234 source B: id=1234 caused an error
these two identifiers need to be unreadable while the user still needs to be able to link the two events together.
As a result of this linkability requirement, the default splunk way of anonymization fails - I can't replace the entire identifier with Xs and I can't keep a part of the identifier readable (like with CC numbers X'ing out everything but say four numbers) because the identifier is not long and random-ish enough to retain both sort-of uniqueness and anonymity.
Usually in IT this is solved by hashing - make up a secret salt, append the identifier, compute hash, index hash instead of identifier. The uniqueness is retained, and it is hard to link the original identifier to the indexed value. Now, how to splunk this best?
I was able to come up with this path:
This works reasonably well, but mangles my host, source, and sourcetype values - and as a result, all extractions and lookups are cut off and the usual search filters stop working. Ideally, I'd only like to modify a small part of _raw and maintain the rest. Hence I've thought of this mildly hacked modification:
That appears to work in a testing stage, but feels a bit (well, a lot) cumbersome and maybe not too robust for a larger production environment. Additionally, this does not circumvent the license counter like the
stash sourcetype can. That's a bit annoying, but not my greatest concern - maintaining searchability is. I'm not certain yet whether I'll need the clear-text data beyond creating the modified copy, so a purely index-once solution may work as well.
In case anyone cares about conf details, here's what my testing stage path looks like:
Scheduled collector/anonymizer search replacing the IP from access_combined_wcookie (without actual hashing, building the custom command is not the issue):
index=main host=apache.fake | rex mode=sed s/^\S+/redacted/g | eval _raw = _time." host=\"".host."\" source=\"".source."\" sourcetype=\"".sourcetype."\" "._raw | collect index=apache_summary file="anonymized_$timestamp$_$random$.stash_new"
Without any transformations, the collected results look like this for generated access log data:
1369322819 host="apache.fake" source="C:\LogGen\ApacheLog.log" sourcetype="access_combined_wcookie" redacted - - [23/Mai/2013:17:26:59] "GET /store/review?product_id=GK-236 HTTP/1.1" 205 10960 "http://mystore.com/Profile/cat?category_id=PLANT&JSESSIONID=SD7SL1FF9ADFF2" "Opera/9.80 (X11; Linux i686; U; en-GB) Presto/2.8.131 Version/11.11" 1540 2225
props.conf for the source named like the scheduled search:
[source::scheduled_search_name] TRANSFORMS-z1 = stashed_host TRANSFORMS-z2 = stashed_source TRANSFORMS-z3 = stashed_sourcetype TRANSFORMS-z9 = stashed_raw
[stashed_host] REGEX = host="([^"]+)" FORMAT = host::$1 DEST_KEY = MetaData:Host [stashed_source] REGEX = source="([^"]+)" FORMAT = source::$1 DEST_KEY = MetaData:Source [stashed_sourcetype] REGEX = sourcetype="([^"]+)" FORMAT = sourcetype::$1 DEST_KEY = MetaData:Sourcetype [stashed_raw] REGEX = (?m)^\d+ host="[^"]+" source="[^"]+" sourcetype="[^"]+" (.*)$ FORMAT = $1 DEST_KEY = _raw
There is always the option to hash the fields of interest after initial indexing and then collect those events to another index that the users are allowed to see.
The best solution is to always solve the problems at the source. Short of that, you can index what you have into an index that is not available to the user and then pull the data with hashing to the fields of interest in the search bar and store it with collect to another index that the users are allowed to search.