Splunk Search

URLDecode and escaping unconvertable characters

aaalexander
Engager

I've just run across an interesting issue with the use of urldecode: if the attempt to decode fails, the function returns an empty string (""). My application logs to Splunk, via HTTP from all around the world, and so all my logs come in encoded. What I've been seeing is akin to the following:

I receive a line like "message=URLError%3A+%3Curlopen+error+%5BErrno+10061%5D+%CF%EE%E4%EA%EB%FE%F7%E5%ED%E8%E5+%ED%E5+%F3%F1%F2%E0%ED%EE%E2%EB%E5%ED%EE%2C%3E%0A" which I then usually decode with eval line=urldecode(message) | table line

This would usually print me out a table of the logs I'm receiving.

However, the above message (URLError%3A+%3Curlopen+error+%5BErrno+10061%5D+%CF%EE%E4%EA%EB%FE%F7%E5%ED%E8%E5+%ED%E5+%F3%F1%F2%E0%ED%EE%E2%EB%E5%ED%EE%2C%3E%0A) fails to be decoded by the urldecode function.

If you trim the line you can see that it decodes fine until you pass 10061%5D. This decodes as 10061] Unfortunately after this point, the decoding fails, and returns the entire things as an empty string.

If you visit http://www.url-encode-decode.com/urldecode and enter the following string you will see that it decodes the bulk of the message, but fails on some of the values, and instead of bombing out completely, returns them as question marks, similar to, for example, 'replace' method used in Python's Unicode encoding works (https://docs.python.org/2/howto/unicode.html?highlight=replace#the-unicode-type😞

>>> u = unichr(40960) + u'abcd' + unichr(1972)

>>> u.encode('ascii', 'replace')

'?abcd?'

This is what I expected to happen, since it means that I can actually use some of the logged information rather than just dropping it.

Does anyone know of a way I can resolve this? Or tell the urldecode function to either a) use a different encoding, or b) to use something akin to the 'replace' functionality?

Thank you for your time.

Tags (1)

yannK
Splunk Employee
Splunk Employee

A workaround is to replace the invalid set of characters by dash "-" before the urldecode.

mysearch | rex mode=sed "s/%[890ABCDEDFabcdef][\d\w]/-/g" | eval decode=urldecode(_raw) | table _raw decode

orion44
Communicator

I have the same problem but I have no idea which specific special characters will break the urldecode functionality as the logged user input varies in my case. Is there no way for Splunk to gracefully handle these situations, per the original question?

0 Karma

landen99
Motivator

I used the sed with a comma replacement before the urldecode,

| rex mode=sed field=hex_url "s/%[890ABCDEDFabcdef][\d\w]/,/g" 

but to remove the special characters, I had to follow the urldecode with

| rex mode=sed field=myfield "s/,\W+,/,/g"

In my case, the special characters were between two commas and they were captured with \W+ (1+ not word characters).

0 Karma

sherm77
Path Finder

This answer worked just fine for me. Thanks! 🙂

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...