Very strange issue we seem to be having. We're running 4.0.9 (hopefully upgrading soon... has to pass validation first) but this seemed to happen on 4.1.6 we tried recently also.
Our server sits inside our data center. We can reach it either via a direct IP address (which Checkpoint allows us to access via firewall) and an Intranet address, which is a NAT address, both of which use TCP port 80.
If we use the direct IP address (that is, the server's true IP address) via the Checkpoint firewall, all searches work perfectly without issue.
However, if we execute the same search via the NAT'd Intranet address, these same specific searches will cause the web session to reset, resulting in a perpetually spinning search icon. The only way to recover is to backspace the URL, erasing the flashtimeline# part and just hitting enter or refreshing the web page.
In observing a TCPdump I can see Splunk sends a RST just after the search is executed. The very last thing the client sends is the request to the Splunk server:
POST /en-US/api/search/jobs HTTP/1.1
In checking the web_service.log file, the following output is observed when the issue occurs:
2011-04-02 01:26:15,276 ERROR customlogmanager:22 - [02/Apr/2011:01:26:15] HTTP
Request Headers:
ACCEPT: application/json, text/javascript, */*, text/javascript, text/html, application/xml, text/xml, */*
Content-Type: application/x-www-form-urlencoded
REFERER: http://10.1.17.150/en-US/app/Explorer_Support/flashtimeline
HOST: 10.1.17.150
CACHE-CONTROL: no-cache
X-REQUESTED-WITH: XMLHttpRequest
Content-Length: 810
USER-AGENT: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; EmbeddedWB 14.52 from: http://www.bsalsa.com/ EmbeddedWB 14.52; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.3; InfoPath.2; MS-RTC EA 2)
CONNECTION: Keep-Alive
COOKIE: session_id_80=e4b300e88879490b0fb97ac7553d183f736ba109
Remote-Addr: 10.65.146.117
ACCEPT-LANGUAGE: en-us
X-SPLUNK-SESSION: e4b300e88879490b0fb97ac7553d183f736ba109
ACCEPT-ENCODING: gzip, deflate
2011-04-02 01:26:15,292 ERROR customlogmanager:22 - [02/Apr/2011:01:26:15] HTTP Traceback (most recent call last):
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cprequest.py", line 600, in respond
self.process_body()
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cprequest.py", line 722, in process_body
keep_blank_values=1)
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cpcgifs.py", line 8, in __init__
cgi.FieldStorage.__init__(self, *args, **kwds)
File "D:\Program Files\Splunk\Python-2.6\Lib\cgi.py", line 506, in __init__
self.read_urlencoded()
File "D:\Program Files\Splunk\Python-2.6\Lib\cgi.py", line 607, in read_urlencoded
qs = self.fp.read(self.length)
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 206, in read
data = self.rfile.read(size)
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 798, in read
data = self.recv(left)
File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 754, in recv
return self._sock.recv(size)
error: [Errno 10054] An existing connection was forcibly closed by the remote host
The 10. Net address you see in the output is the NAT address we use on our Intranet. The server's actual address is something else, and as I mentioned, when we use that one, it works fine. We need the Intranet NAT address to work because not everyone who uses Splunk has access to come in via the firewall.
I've also tried different browsers (IE, FF, Chrome) and the same thing happens, so clearly this is something with the HTTP server itself.
Any ideas?
When there is a stateful firewall (especially one with NAT) in the middle, one tcpdump
is never enough. You should capture a tcpdump
on each device and compare and contrast. For example, you will need to confirm that the RST is coming from the Splunk host itself - and not being generated "helpfully" by the Checkpoint.
It is not uncommon for software defects in NAT devices to either foul up the content of a TCP session, or to get confused by the content of one.
Do you have/can you get a pcap dump from tcpdump of traffic to/from port 80 from the machine Splunk is running on?