Very strange issue we seem to be having. We're running 4.0.9 (hopefully upgrading soon... has to pass validation first) but this seemed to happen on 4.1.6 we tried recently also.
Our server sits inside our data center. We can reach it either via a direct IP address (which Checkpoint allows us to access via firewall) and an Intranet address, which is a NAT address, both of which use TCP port 80.
If we use the direct IP address (that is, the server's true IP address) via the Checkpoint firewall, all searches work perfectly without issue.
However, if we execute the same search via the NAT'd Intranet address, these same specific searches will cause the web session to reset, resulting in a perpetually spinning search icon. The only way to recover is to backspace the URL, erasing the flashtimeline# part and just hitting enter or refreshing the web page.
In observing a TCPdump I can see Splunk sends a RST just after the search is executed. The very last thing the client sends is the request to the Splunk server:
POST /en-US/api/search/jobs HTTP/1.1
In checking the web_service.log file, the following output is observed when the issue occurs:
2011-04-02 01:26:15,276 ERROR customlogmanager:22 - [02/Apr/2011:01:26:15] HTTP
USER-AGENT: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; EmbeddedWB 14.52 from: http://www.bsalsa.com/ EmbeddedWB 14.52; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.3; InfoPath.2; MS-RTC EA 2)
ACCEPT-ENCODING: gzip, deflate
2011-04-02 01:26:15,292 ERROR customlogmanager:22 - [02/Apr/2011:01:26:15] HTTP Traceback (most recent call last):
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cprequest.py", line 600, in respond
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cprequest.py", line 722, in process_body
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cpcgifs.py", line 8, in __init__
cgi.FieldStorage.__init__(self, *args, **kwds)
File "D:\Program Files\Splunk\Python-2.6\Lib\cgi.py", line 506, in __init__
File "D:\Program Files\Splunk\Python-2.6\Lib\cgi.py", line 607, in read_urlencoded
qs = self.fp.read(self.length)
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 206, in read
data = self.rfile.read(size)
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 798, in read
data = self.recv(left)
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 754, in recv
error: [Errno 10054] An existing connection was forcibly closed by the remote host
The 10. Net address you see in the output is the NAT address we use on our Intranet. The server's actual address is something else, and as I mentioned, when we use that one, it works fine. We need the Intranet NAT address to work because not everyone who uses Splunk has access to come in via the firewall.
I've also tried different browsers (IE, FF, Chrome) and the same thing happens, so clearly this is something with the HTTP server itself.
When there is a stateful firewall (especially one with NAT) in the middle, one
tcpdump is never enough. You should capture a
tcpdump on each device and compare and contrast. For example, you will need to confirm that the RST is coming from the Splunk host itself - and not being generated "helpfully" by the Checkpoint.
It is not uncommon for software defects in NAT devices to either foul up the content of a TCP session, or to get confused by the content of one.