If I have a field value that is URL encoded then base-64 encoded, is it possible to have Splunk decode this field before indexing (maybe via a custom processor)? Has anyone done this before? Is it recommended? How difficult is it?
This is probably easily done with a custom search script at search time, but that is a less desirable approach as a user would need to have advanced understanding to run the search through this custom search command.
Here is a sample event with the body
field encoded:
2010-02-26 03:19:29 : LOG: M=Ce3zW5GtsGE= A=anonymous S=48976970336315650 pt=100001 body=T%3d2010-02-26%2003%3a17%3a45%20PST%26L%3di%26M%3d%5bg2mfeedback%5d%26N%3d553%26X%3d%253cG2MFeedback%253e%2520FeedbackTracker%253a%253aupdate()%2520lastUpdateTime%25201267183021171%2520curTime%25201267183051205%2520timeSinceUpdate%252030034%2520currentAttentivenessState%25201%2520_currentSatisfactionState%25202%2520-%2520Tracker%2520025A6658%252c%2520Seconds%2520in%2520great%252039818%253b%2520fair%25200%253b%2520poor%25200%253b%2520attentive%252039818%253b%2520not%25200%0d%0aT%3d
In ooooold days (1.0-ish), Splunk imagined the processors as a customer-available API, but there were a variety of problems. The binary interfaces were too brittle, and the API-challenges were not conducive to plugging in arbitrary code.
While it's technically still possible to plug in your own processor by wiring up the xml and building the code just so, it's not easy, and definitely not recommended.
The more loosely coupled approach of handling this in an input script is probably the way to go. You can be fancy and set up a scripted input, which will end up being responsible for checkpointing and file handling. My preference is to just have a script that preprocesses foo.log into foo.log.processed, or similar, and have Splunk watch the processed version. It's easy to write, easy to debug, and easy to configure Splunk to use.
On re-read, I don't see any performance concerns. You can achieve your field filtering transparently via a scripted lookup
Alright, fine, you get the idea.
Hmm I meant to format that:
FIELDALIAS-body = body AS body_encoded
LOOKUP-urldecode = urldecode body_encoded OUTPUT body_decoded AS body
You could also do:
FIELDALIAS-body = body AS body_encoded
LOOKUP-urldecode = urldecode body_encoded OUTPUT body_decoded AS body
This will work, as the order goes: EXTRACT
, FIELDALIAS
, LOOKUP
. You could also just change your extraction to extract body
as body_encoded
, but that might be a pain if you're just using KV_MODE.
I implemented the external lookup. The encoding turns out to be a double URL encoding, not a URL encoding followed by a base 64 encoding as originally stated. The lookup works just okay--it presents a new field 'body_decoded' with decoded field value. However, since the decoding is done at search time, searching is awkward. You need to use 'body_decoded=coolstuff'. A keyword search does not work since the value of the 'body' field was not segmented at index time. We will have to pursue the alternative--process the log file before indexing. Wish this could be done in Splunk more easily.
Thank you, Josh! I think this is the most promising approach. I will give it a try and post results here.
In ooooold days (1.0-ish), Splunk imagined the processors as a customer-available API, but there were a variety of problems. The binary interfaces were too brittle, and the API-challenges were not conducive to plugging in arbitrary code.
While it's technically still possible to plug in your own processor by wiring up the xml and building the code just so, it's not easy, and definitely not recommended.
The more loosely coupled approach of handling this in an input script is probably the way to go. You can be fancy and set up a scripted input, which will end up being responsible for checkpointing and file handling. My preference is to just have a script that preprocesses foo.log into foo.log.processed, or similar, and have Splunk watch the processed version. It's easy to write, easy to debug, and easy to configure Splunk to use.