Solved: Can Splunk decode data at index time?

hulahoop · ‎03-08-2010

If I have a field value that is URL encoded then base-64 encoded, is it possible to have Splunk decode this field before indexing (maybe via a custom processor)? Has anyone done this before? Is it recommended? How difficult is it?

This is probably easily done with a custom search script at search time, but that is a less desirable approach as a user would need to have advanced understanding to run the search through this custom search command.

Here is a sample event with the body field encoded:

2010-02-26 03:19:29    : LOG: M=Ce3zW5GtsGE= A=anonymous S=48976970336315650 pt=100001 body=T%3d2010-02-26%2003%3a17%3a45%20PST%26L%3di%26M%3d%5bg2mfeedback%5d%26N%3d553%26X%3d%253cG2MFeedback%253e%2520FeedbackTracker%253a%253aupdate()%2520lastUpdateTime%25201267183021171%2520curTime%25201267183051205%2520timeSinceUpdate%252030034%2520currentAttentivenessState%25201%2520_currentSatisfactionState%25202%2520-%2520Tracker%2520025A6658%252c%2520Seconds%2520in%2520great%252039818%253b%2520fair%25200%253b%2520poor%25200%253b%2520attentive%252039818%253b%2520not%25200%0d%0aT%3d

jrodman · ‎03-08-2010

In ooooold days (1.0-ish), Splunk imagined the processors as a customer-available API, but there were a variety of problems. The binary interfaces were too brittle, and the API-challenges were not conducive to plugging in arbitrary code.

While it's technically still possible to plug in your own processor by wiring up the xml and building the code just so, it's not easy, and definitely not recommended.

The more loosely coupled approach of handling this in an input script is probably the way to go. You can be fancy and set up a scripted input, which will end up being responsible for checkpointing and file handling. My preference is to just have a script that preprocesses foo.log into foo.log.processed, or similar, and have Splunk watch the processed version. It's easy to write, easy to debug, and easy to configure Splunk to use.

View solution in original post

jrodman · ‎03-08-2010

On re-read, I don't see any performance concerns. You can achieve your field filtering transparently via a scripted lookup

http://docs.splunk.com/Documentation/Splunk/5.0/Knowledge/Addfieldsfromexternaldatasources#Set_up_a_...

gkanapathy · ‎03-10-2010

Alright, fine, you get the idea.

gkanapathy · ‎03-10-2010

Hmm I meant to format that:

 FIELDALIAS-body = body AS body_encoded
 LOOKUP-urldecode = urldecode body_encoded OUTPUT body_decoded AS body

gkanapathy · ‎03-10-2010

You could also do:

FIELDALIAS-body = body AS body_encoded
LOOKUP-urldecode = urldecode body_encoded OUTPUT body_decoded AS body

This will work, as the order goes: EXTRACT, FIELDALIAS, LOOKUP. You could also just change your extraction to extract body as body_encoded, but that might be a pain if you're just using KV_MODE.

hulahoop · ‎03-09-2010

I implemented the external lookup. The encoding turns out to be a double URL encoding, not a URL encoding followed by a base 64 encoding as originally stated. The lookup works just okay--it presents a new field 'body_decoded' with decoded field value. However, since the decoding is done at search time, searching is awkward. You need to use 'body_decoded=coolstuff'. A keyword search does not work since the value of the 'body' field was not segmented at index time. We will have to pursue the alternative--process the log file before indexing. Wish this could be done in Splunk more easily.

hulahoop · ‎03-08-2010

Thank you, Josh! I think this is the most promising approach. I will give it a try and post results here.

jrodman · ‎03-08-2010

In ooooold days (1.0-ish), Splunk imagined the processors as a customer-available API, but there were a variety of problems. The binary interfaces were too brittle, and the API-challenges were not conducive to plugging in arbitrary code.

While it's technically still possible to plug in your own processor by wiring up the xml and building the code just so, it's not easy, and definitely not recommended.

The more loosely coupled approach of handling this in an input script is probably the way to go. You can be fancy and set up a scripted input, which will end up being responsible for checkpointing and file handling. My preference is to just have a script that preprocesses foo.log into foo.log.processed, or similar, and have Splunk watch the processed version. It's easy to write, easy to debug, and easy to configure Splunk to use.

Can Splunk decode data at index time?

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25

Are you a member of the Splunk Community?

Can Splunk decode data at index time?

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25