Solved: Modify _raw, collect into second index - how to be...

martin_mueller · ‎05-24-2013

First off, TL;DR: How to best anonymize/hash/encrypt parts of _raw while keeping everything else as-is?

I've got various sources with well-known identifiers that shall not be readable in the clear for the average splunk user, while the sources themselves need to be searchable and events need to be correlate-able for individual identifiers. For example, if I have events like this

source A: something happened for id=1234
source B: id=1234 caused an error

these two identifiers need to be unreadable while the user still needs to be able to link the two events together.

As a result of this linkability requirement, the default splunk way of anonymization fails - I can't replace the entire identifier with Xs and I can't keep a part of the identifier readable (like with CC numbers X'ing out everything but say four numbers) because the identifier is not long and random-ish enough to retain both sort-of uniqueness and anonymity.

Usually in IT this is solved by hashing - make up a secret salt, append the identifier, compute hash, index hash instead of identifier. The uniqueness is retained, and it is hard to link the original identifier to the indexed value. Now, how to splunk this best?

I was able to come up with this path:

index clear-text identifiers into an invisible-to-users index
schedule a search that performs hashing with a custom command for a recent slice of events
collect the results into a visible-to-users index

This works reasonably well, but mangles my host, source, and sourcetype values - and as a result, all extractions and lookups are cut off and the usual search filters stop working. Ideally, I'd only like to modify a small part of _raw and maintain the rest. Hence I've thought of this mildly hacked modification:

prepend _time, host, source, sourcetype to _raw in the scheduled search
collect that
transform host, source, sourcetype into their proper meta fields
sedcmd away the prepended values

That appears to work in a testing stage, but feels a bit (well, a lot) cumbersome and maybe not too robust for a larger production environment. Additionally, this does not circumvent the license counter like the stash sourcetype can. That's a bit annoying, but not my greatest concern - maintaining searchability is. I'm not certain yet whether I'll need the clear-text data beyond creating the modified copy, so a purely index-once solution may work as well.

In case anyone cares about conf details, here's what my testing stage path looks like:

Scheduled collector/anonymizer search replacing the IP from access_combined_wcookie (without actual hashing, building the custom command is not the issue):

index=main host=apache.fake | rex mode=sed s/^\S+/redacted/g | eval _raw = _time." host=\"".host."\" source=\"".source."\" sourcetype=\"".sourcetype."\" "._raw | collect index=apache_summary file="anonymized_$timestamp$_$random$.stash_new"

Without any transformations, the collected results look like this for generated access log data:

1369322819 host="apache.fake" source="C:\LogGen\ApacheLog.log" sourcetype="access_combined_wcookie" redacted - - [23/Mai/2013:17:26:59] "GET /store/review?product_id=GK-236 HTTP/1.1" 205 10960 "http://mystore.com/Profile/cat?category_id=PLANT&JSESSIONID=SD7SL1FF9ADFF2" "Opera/9.80 (X11; Linux i686; U; en-GB) Presto/2.8.131 Version/11.11" 1540 2225

props.conf for the source named like the scheduled search:

[source::scheduled_search_name]
TRANSFORMS-z1 = stashed_host
TRANSFORMS-z2 = stashed_source
TRANSFORMS-z3 = stashed_sourcetype
TRANSFORMS-z9 = stashed_raw

transforms.conf:

[stashed_host]
REGEX = host="([^"]+)"
FORMAT = host::$1
DEST_KEY = MetaData:Host

[stashed_source]
REGEX = source="([^"]+)"
FORMAT = source::$1
DEST_KEY = MetaData:Source

[stashed_sourcetype]
REGEX = sourcetype="([^"]+)"
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype

[stashed_raw]
REGEX = (?m)^\d+ host="[^"]+" source="[^"]+" sourcetype="[^"]+" (.*)$
FORMAT = $1
DEST_KEY = _raw

martin_mueller · ‎08-01-2013

I've gone a different route - make the application produce anonymized values in the first place.

View solution in original post

landen99 · ‎01-29-2016

There is always the option to hash the fields of interest after initial indexing and then collect those events to another index that the users are allowed to see.

martin_mueller · ‎08-01-2013

I've gone a different route - make the application produce anonymized values in the first place.

martin_mueller · ‎02-10-2019

Six years later, you can finally use 7.2's INGEST_EVAL to perform hashing right at index time.

landen99 · ‎01-29-2016

The best solution is to always solve the problems at the source. Short of that, you can index what you have into an index that is not available to the user and then pull the data with hashing to the fields of interest in the search bar and store it with collect to another index that the users are allowed to search.

Modify _raw, collect into second index - how to best retain host, source, sourcetype?

Thanks for the Memories! Splunk University, .conf25, and our Community

Data Persistence in the OpenTelemetry Collector

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever

Are you a member of the Splunk Community?

Modify _raw, collect into second index - how to best retain host, source, sourcetype?

Thanks for the Memories! Splunk University, .conf25, and our Community

Data Persistence in the OpenTelemetry Collector

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever