Hi folks, I get a lot of "Connection Timed Out" (or null response code) quite a few times in a day for a few of my websites that I monitor. I guess it is because of the default connection time threshold in the app is too low for my websites. Could you please help me in identifying how to change the threshold wait time so that it does not throw red alert too often? The websites might be too slow sometimes but they are usually up and working. Thanks in advance!
It depends on what you mean by threshold. There are two thresholds that are in play here (see below). If you are not seeing the response code then you site is likely exceeding the HTTP connection timeout. That timeout is set to 30 seconds which seems pretty high already; most users wouldn't wait this long for the site to respond.
Below are the two timeouts.
HTTP connection timeout
The modular input has a timeout that indicates when the code should stop waiting for the site to respond. This is hard-coded in the python modular input code at 30 seconds. This is defined in web_ping.py in the constructor:
def __init__(self, timeout=30):
...
Response Time threshold
The app includes a macro that defines when a site response time is considered too long. If a response time exceeds this time, then the user interface considers the site too slow. You can modify this macro in the manager by going to Advanced search » Search macros » response_time_threshold. Note that this value is in milliseconds.
It depends on what you mean by threshold. There are two thresholds that are in play here (see below). If you are not seeing the response code then you site is likely exceeding the HTTP connection timeout. That timeout is set to 30 seconds which seems pretty high already; most users wouldn't wait this long for the site to respond.
Below are the two timeouts.
HTTP connection timeout
The modular input has a timeout that indicates when the code should stop waiting for the site to respond. This is hard-coded in the python modular input code at 30 seconds. This is defined in web_ping.py in the constructor:
def __init__(self, timeout=30):
...
Response Time threshold
The app includes a macro that defines when a site response time is considered too long. If a response time exceeds this time, then the user interface considers the site too slow. You can modify this macro in the manager by going to Advanced search » Search macros » response_time_threshold. Note that this value is in milliseconds.
I think that you may be correct that some other issue or network issue might be in play here. It is also possible that the website drops requests every once in a while. For example, I noticed that my web-server had a very long response time every once in a while. It turned out that this due to the cache being invalidated and needing to be refreshed.
I wonder if web-browsers might hide this sort of problem by re-trying the connection. I don't understand entirely when browsers retry connections though I know they do in some cases.
I considered making the input retry connections if it doesn't get a response (it wouldn't be hard). However, it seems like this may hide actual problems.
You could filter connection outages from the alert search. That way, at least you wouldn't get emails regarding them. I just updated the docs to indicate how to do this.
Thanks a lot, Luke for your quick and detailed response! I have altered my alert based on your suggestion and that seems to be working just fine.
Just a suggestion for you future release (not sure if it is feasible or appropriate)- If connection retry feature (user configurable one) is included in the app, then that would be great.
Thanks a lot, Luke. I usually get response_code as "Connection timed out" in such cases that I was referring.
I have setup an email alert for this condition. As a check, I used to immediately open the website manually after receiving the alert, only to see website being all fine.
: I agree that the 30 seconds threshold is web_ping.py is quite high and no change is needed there (Manually opening the site never took that long, max 4-5 sec).
: I guess this does not affect in this case as there is no integer value in response_code.
So, I am wondering whether it is a random network/other issue that causes this or is it because of some logical error in the alert or am I missing something else here?
sourcetype="web_ping" earliest=-15m@m | fillnull response_code value="Connection failed" | eval response_code=if(timed_out == "True", "Connection timed out", response_code) | stats latest(response_code) as response_code latest(_time) as last_checked latest(title) as title latest(total_time) as response_time by url | table title url response_code last_checked|timesince(last_checked,last_checked)
|search NOT (response_code>=200 AND response_code<400)
Thanks in advance!