Getting Data In

Hashing instead of masking at index time

Lowell
Super Champion

During the Splunk parsing phase, is there any way to hash portions of the event? I know it's possible to discard or mask (trim) portions of the event using SEDCMD or a transformer, but I don't see any options for hashing.

I'm looking for a pure Splunk solution that doesn't require scripted (or modular) inputs. Calling out from Splunk would be acceptable, but I'm unaware of any custom "hooks" in the parsing phase (for performance and stability reasons, I assume).

I'm pretty sure I know the answer to this, but figured I'd ask before sending in a feature request.

For a bit of background:

Personally identifiable information (SSN, credit cards numbers, passwords,...) ends up in log files in clear text. The core issue is often a software development one, but often the Splunk admins have now way to control this. The safest option is to remove the sensitive info, but then you loose visibility. Sometimes keeping a few characters will provide enough detail to compare different events without giving away the entire secret, but of course the risks are: (1) Some of the information is still available in clear text, potentially revealing too much information and (2) since the full value isn't known, it's not possible to accurately compare values. Using a hash function (like MD5 or SHA) the values instead would (1) fully protect the original value from being discovered, and (2) still allows for accurate grouping and/or transaction operations on the sensitive field.

Lowell
Super Champion

I just want to point out the that ELK stack can do this!

So my answer is:

  1. Deploy LogStash
  2. Configure it to read in the log
  3. Configure the hashing transformation
  4. Dump the output to a new log file
  5. Ingest the new log file with Splunk (UF)
0 Karma

Lowell
Super Champion

I think the invalid_clause is set right. I even tried just reproducing the default process-gzip stuff and can't make that work. Posted as separate question here:
http://answers.splunk.com/answers/143771/whats-the-trick-to-get-unarchive_cmd-to-work-for-a-custom-a...

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Obvious question is obvious: You did set the invalid_cause, right?

0 Karma

Lowell
Super Champion

Wow, that's an interesting approach. Sounds like something I would have dreamt up ;-).

So I actually thought about the unarchive_cmd option after posting the question and have played around a bit, but so far with no success. After I cranked up the DEBUG logs I'm finally seeing DEBUG ArchiveContext - /tmp/blah-debug.test.me is NOT an archive file. I get the same error even if the file IS in gzip format, so I'm puzzled.

Agreed that the incremental indexing thing could be a problem, but I may be able to work around that for the use case in front of me.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I've had this requirement before: http://answers.splunk.com/answers/88926/modify-_raw-collect-into-second-index-how-to-best-retain-hos...

Basically no pretty Splunk-only solution.

I didn't toy around with props.conf's unarchive_cmd to see if you could hook a custom script into the indexing process using that... If I had to guess I'd say that might break incremental indexing because that's not available for .gz files either, but it might be worth a shot.

0 Karma