All Apps and Add-ons

Website input app: Python error "LookupError: unknown encoding: 3Dutf-8="

moseisleydk
Path Finder

I get the error:

13/12/2017
20:51:38.141    
2017-12-13 20:51:38,141 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
Traceback (most recent call last):
  File "/splunk/etc/apps/website_input/bin/web_input.py", line 349, in run
    https_only=self.is_on_cloud(input_config.session_key))
  File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 710, in scrape_page
    additional_fields=additional_fields, **kw)
  File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 446, in get_result_single
    content_decoded = content.decode(encoding=encoding, errors='replace')
LookupError: unknown encoding: 3Dutf-8=
0 Karma

LukeMurphey
Champion

This is a confirmed bug. I was able to reproduce this using the unit test framework which simulates a web-server providing an encoding that is invalid. See the bug report here: https://lukemurphey.net/issues/2190.

I have updated the app to now be forgiving if it sees an encoding it doesn't recognize. This is currently working. This fix will go out in version 4.5.2 (ETA: early next week).

0 Karma

LukeMurphey
Champion

@moseisleydk: thanks for the report.

Incidentally, I was unable to reproduce this on http://www.mos-eisley.dk today. Not sure if something changed.

This was still valid bug report though as I was able to reproduce this by recreating the scenario based on the stacktrace you provided.

0 Karma

moseisleydk
Path Finder

Excellent - looking forward to it. I still get the error on 4.5.1:

2018-01-27 07:29:58,776 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
Traceback (most recent call last):
File "/splunk/etc/apps/website_input/bin/web_input.py", line 349, in run
https_only=self.is_on_cloud(input_config.session_key))
File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 710, in scrape_page
additional_fields=additional_fields, **kw)
File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 446, in get_result_single
content_decoded = content.decode(encoding=encoding, errors='replace')
LookupError: unknown encoding: 3Dutf-8=

0 Karma

LukeMurphey
Champion

@moseisleydk: Would you mind testing 4.5.2? You can get the app here: https://github.com/LukeMurphey/splunk-web-input/releases/tag/4.5.2-rc.1

I want to make sure that this fixes the issue since I wasn't able to reproduce the issue on 4.5.1 with your website.

0 Karma

moseisleydk
Path Finder
02/07/2018 21:14:00.529 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR   An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO    Running web input, url="http://www.mos-eisley.dk"
0 Karma

LukeMurphey
Champion

A little background on what is going on here. The encoding is not getting detected properly. I setup the input to deal better with a bad encoding. However, since the input doesn't know the proper encoding, it fails to parse the output.

What is really weird, is that I'm not getting the same repro. I have tried several times but it never quite repros the same.

0 Karma

LukeMurphey
Champion

@moseisleydk: could you provide some more details? I'm sorry for the back-and-forth; I'm just struggling to get a solid repro. I tried today and get a partial repro.

Here are some questions:

Are results coming through for any of the URLs?
You can try running the following search to get a source="web_input://www_mos_eisley_dk" | table _time url match*

In my case, I am finding that I get results for everything but "http://www.mos-eisley.dk/dashboard/\\". That URL seems to just do a redirect to "http://www.mos-eisley.dk/dashboard/" which I do get results for.

What platform and version of Splunk is this running on?
I'm wondering if I cannot get an identical repro because I'm not on the same platform.

0 Karma

moseisleydk
Path Finder

Hi,

If needed, I can give you full access, mail me at npn@netic.dk or bnp@mos-eisley.dk

BR,

Normann

0 Karma

LukeMurphey
Champion

Ok, I'll hit you up on email.

0 Karma

moseisleydk
Path Finder
02/07/2018 21:14:00.529 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR   An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO    Running web input, url="http://www.mos-eisley.dk"
0 Karma

moseisleydk
Path Finder

Logs:

02/07/2018 21:14:00.529 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR   An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO    Running web input, url="http://www.mos-eisley.dk"
0 Karma

LukeMurphey
Champion

Could you share the URL that you are using if it is a publically available one? I would like to reproduce this myself. It looks like the website is provided an invalid encoding and the Website Inputs app doesn't handle that yet. I want to update the app to handle it more gracefully.

0 Karma

moseisleydk
Path Finder

Its http://www.mos-eisley.dk - feel free 🙂

Splunk 7.0.1

And feel free to ask for futher info !

0 Karma

moseisleydk
Path Finder

BTW . Its Confluence from Atlassian

0 Karma
Get Updates on the Splunk Community!

Dashboards: Hiding charts while search is being executed and other uses for tokens

There are a couple of features of SimpleXML / Classic dashboards that can be used to enhance the user ...

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

This is the fourth post in the Splunk Observability Cloud’s AI Assistant in Action series that digs into how ...

Brains, Bytes, and Boston: Learn from the Best at .conf25

When you think of Boston, you might picture colonial charm, world-class universities, or even the crack of a ...