Getting Data In

Can Splunk decode data at index time?

Splunk Employee
Splunk Employee

If I have a field value that is URL encoded then base-64 encoded, is it possible to have Splunk decode this field before indexing (maybe via a custom processor)? Has anyone done this before? Is it recommended? How difficult is it?

This is probably easily done with a custom search script at search time, but that is a less desirable approach as a user would need to have advanced understanding to run the search through this custom search command.

Here is a sample event with the body field encoded:

2010-02-26 03:19:29    : LOG: M=Ce3zW5GtsGE= A=anonymous S=48976970336315650 pt=100001 body=T%3d2010-02-26%2003%3a17%3a45%20PST%26L%3di%26M%3d%5bg2mfeedback%5d%26N%3d553%26X%3d%253cG2MFeedback%253e%2520FeedbackTracker%253a%253aupdate()%2520lastUpdateTime%25201267183021171%2520curTime%25201267183051205%2520timeSinceUpdate%252030034%2520currentAttentivenessState%25201%2520_currentSatisfactionState%25202%2520-%2520Tracker%2520025A6658%252c%2520Seconds%2520in%2520great%252039818%253b%2520fair%25200%253b%2520poor%25200%253b%2520attentive%252039818%253b%2520not%25200%0d%0aT%3d
1 Solution

Splunk Employee
Splunk Employee

In ooooold days (1.0-ish), Splunk imagined the processors as a customer-available API, but there were a variety of problems. The binary interfaces were too brittle, and the API-challenges were not conducive to plugging in arbitrary code.

While it's technically still possible to plug in your own processor by wiring up the xml and building the code just so, it's not easy, and definitely not recommended.

The more loosely coupled approach of handling this in an input script is probably the way to go. You can be fancy and set up a scripted input, which will end up being responsible for checkpointing and file handling. My preference is to just have a script that preprocesses foo.log into foo.log.processed, or similar, and have Splunk watch the processed version. It's easy to write, easy to debug, and easy to configure Splunk to use.

View solution in original post

Splunk Employee
Splunk Employee

On re-read, I don't see any performance concerns. You can achieve your field filtering transparently via a scripted lookup

http://docs.splunk.com/Documentation/Splunk/5.0/Knowledge/Addfieldsfromexternaldatasources#Set_up_a_...

0 Karma

Splunk Employee
Splunk Employee

Alright, fine, you get the idea.

0 Karma

Splunk Employee
Splunk Employee

Hmm I meant to format that:

 FIELDALIAS-body = body AS body_encoded
 LOOKUP-urldecode = urldecode body_encoded OUTPUT body_decoded AS body
0 Karma

Splunk Employee
Splunk Employee

You could also do:

FIELDALIAS-body = body AS body_encoded
LOOKUP-urldecode = urldecode body_encoded OUTPUT body_decoded AS body

This will work, as the order goes: EXTRACT, FIELDALIAS, LOOKUP. You could also just change your extraction to extract body as body_encoded, but that might be a pain if you're just using KV_MODE.

0 Karma

Splunk Employee
Splunk Employee

I implemented the external lookup. The encoding turns out to be a double URL encoding, not a URL encoding followed by a base 64 encoding as originally stated. The lookup works just okay--it presents a new field 'body_decoded' with decoded field value. However, since the decoding is done at search time, searching is awkward. You need to use 'body_decoded=coolstuff'. A keyword search does not work since the value of the 'body' field was not segmented at index time. We will have to pursue the alternative--process the log file before indexing. Wish this could be done in Splunk more easily.

0 Karma

Splunk Employee
Splunk Employee

Thank you, Josh! I think this is the most promising approach. I will give it a try and post results here.

0 Karma

Splunk Employee
Splunk Employee

In ooooold days (1.0-ish), Splunk imagined the processors as a customer-available API, but there were a variety of problems. The binary interfaces were too brittle, and the API-challenges were not conducive to plugging in arbitrary code.

While it's technically still possible to plug in your own processor by wiring up the xml and building the code just so, it's not easy, and definitely not recommended.

The more loosely coupled approach of handling this in an input script is probably the way to go. You can be fancy and set up a scripted input, which will end up being responsible for checkpointing and file handling. My preference is to just have a script that preprocesses foo.log into foo.log.processed, or similar, and have Splunk watch the processed version. It's easy to write, easy to debug, and easy to configure Splunk to use.

View solution in original post

State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!