Splunk Search

URLDecode and escaping unconvertable characters

aaalexander
Engager

I've just run across an interesting issue with the use of urldecode: if the attempt to decode fails, the function returns an empty string (""). My application logs to Splunk, via HTTP from all around the world, and so all my logs come in encoded. What I've been seeing is akin to the following:

I receive a line like "message=URLError%3A+%3Curlopen+error+%5BErrno+10061%5D+%CF%EE%E4%EA%EB%FE%F7%E5%ED%E8%E5+%ED%E5+%F3%F1%F2%E0%ED%EE%E2%EB%E5%ED%EE%2C%3E%0A" which I then usually decode with eval line=urldecode(message) | table line

This would usually print me out a table of the logs I'm receiving.

However, the above message (URLError%3A+%3Curlopen+error+%5BErrno+10061%5D+%CF%EE%E4%EA%EB%FE%F7%E5%ED%E8%E5+%ED%E5+%F3%F1%F2%E0%ED%EE%E2%EB%E5%ED%EE%2C%3E%0A) fails to be decoded by the urldecode function.

If you trim the line you can see that it decodes fine until you pass 10061%5D. This decodes as 10061] Unfortunately after this point, the decoding fails, and returns the entire things as an empty string.

If you visit http://www.url-encode-decode.com/urldecode and enter the following string you will see that it decodes the bulk of the message, but fails on some of the values, and instead of bombing out completely, returns them as question marks, similar to, for example, 'replace' method used in Python's Unicode encoding works (https://docs.python.org/2/howto/unicode.html?highlight=replace#the-unicode-type😞

>>> u = unichr(40960) + u'abcd' + unichr(1972)

>>> u.encode('ascii', 'replace')

'?abcd?'

This is what I expected to happen, since it means that I can actually use some of the logged information rather than just dropping it.

Does anyone know of a way I can resolve this? Or tell the urldecode function to either a) use a different encoding, or b) to use something akin to the 'replace' functionality?

Thank you for your time.

Tags (1)

yannK
Splunk Employee
Splunk Employee

A workaround is to replace the invalid set of characters by dash "-" before the urldecode.

mysearch | rex mode=sed "s/%[890ABCDEDFabcdef][\d\w]/-/g" | eval decode=urldecode(_raw) | table _raw decode

orion44
Communicator

I have the same problem but I have no idea which specific special characters will break the urldecode functionality as the logged user input varies in my case. Is there no way for Splunk to gracefully handle these situations, per the original question?

0 Karma

landen99
Motivator

I used the sed with a comma replacement before the urldecode,

| rex mode=sed field=hex_url "s/%[890ABCDEDFabcdef][\d\w]/,/g" 

but to remove the special characters, I had to follow the urldecode with

| rex mode=sed field=myfield "s/,\W+,/,/g"

In my case, the special characters were between two commas and they were captured with \W+ (1+ not word characters).

0 Karma

sherm77
Path Finder

This answer worked just fine for me. Thanks! 🙂

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...