During the Splunk parsing phase, is there any way to hash portions of the event? I know it's possible to discard or mask (trim) portions of the event using SEDCMD or a transformer, but I don't see any options for hashing.
I'm looking for a pure Splunk solution that doesn't require scripted (or modular) inputs. Calling out from Splunk would be acceptable, but I'm unaware of any custom "hooks" in the parsing phase (for performance and stability reasons, I assume).
I'm pretty sure I know the answer to this, but figured I'd ask before sending in a feature request.
For a bit of background:
Personally identifiable information (SSN, credit cards numbers, passwords,...) ends up in log files in clear text. The core issue is often a software development one, but often the Splunk admins have now way to control this. The safest option is to remove the sensitive info, but then you loose visibility. Sometimes keeping a few characters will provide enough detail to compare different events without giving away the entire secret, but of course the risks are: (1) Some of the information is still available in clear text, potentially revealing too much information and (2) since the full value isn't known, it's not possible to accurately compare values. Using a hash function (like MD5 or SHA) the values instead would (1) fully protect the original value from being discovered, and (2) still allows for accurate grouping and/or transaction operations on the sensitive field.
I just want to point out the that ELK stack can do this!
So my answer is:
I think the invalid_clause is set right. I even tried just reproducing the default
process-gzip stuff and can't make that work. Posted as separate question here:
Wow, that's an interesting approach. Sounds like something I would have dreamt up ;-).
So I actually thought about the
unarchive_cmd option after posting the question and have played around a bit, but so far with no success. After I cranked up the DEBUG logs I'm finally seeing
DEBUG ArchiveContext - /tmp/blah-debug.test.me is NOT an archive file. I get the same error even if the file IS in gzip format, so I'm puzzled.
Agreed that the incremental indexing thing could be a problem, but I may be able to work around that for the use case in front of me.
I've had this requirement before: http://answers.splunk.com/answers/88926/modify-_raw-collect-into-second-index-how-to-best-retain-hos...
Basically no pretty Splunk-only solution.
I didn't toy around with props.conf's
unarchive_cmd to see if you could hook a custom script into the indexing process using that... If I had to guess I'd say that might break incremental indexing because that's not available for .gz files either, but it might be worth a shot.