NAT IP causes certain searches to crash CherryPy a...

TR_Splunker · ‎04-02-2011

Very strange issue we seem to be having. We're running 4.0.9 (hopefully upgrading soon... has to pass validation first) but this seemed to happen on 4.1.6 we tried recently also.

Our server sits inside our data center. We can reach it either via a direct IP address (which Checkpoint allows us to access via firewall) and an Intranet address, which is a NAT address, both of which use TCP port 80.

If we use the direct IP address (that is, the server's true IP address) via the Checkpoint firewall, all searches work perfectly without issue.

However, if we execute the same search via the NAT'd Intranet address, these same specific searches will cause the web session to reset, resulting in a perpetually spinning search icon. The only way to recover is to backspace the URL, erasing the flashtimeline# part and just hitting enter or refreshing the web page.

In observing a TCPdump I can see Splunk sends a RST just after the search is executed. The very last thing the client sends is the request to the Splunk server:

POST /en-US/api/search/jobs HTTP/1.1

In checking the web_service.log file, the following output is observed when the issue occurs:

2011-04-02 01:26:15,276 ERROR customlogmanager:22 - [02/Apr/2011:01:26:15] HTTP

Request Headers:

ACCEPT: application/json, text/javascript, */*, text/javascript, text/html, application/xml, text/xml, */*

Content-Type: application/x-www-form-urlencoded

REFERER: http://10.1.17.150/en-US/app/Explorer_Support/flashtimeline

HOST: 10.1.17.150

CACHE-CONTROL: no-cache

X-REQUESTED-WITH: XMLHttpRequest

Content-Length: 810

USER-AGENT: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; EmbeddedWB 14.52 from: http://www.bsalsa.com/ EmbeddedWB 14.52; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.3; InfoPath.2; MS-RTC EA 2)

CONNECTION: Keep-Alive

COOKIE: session_id_80=e4b300e88879490b0fb97ac7553d183f736ba109

Remote-Addr: 10.65.146.117

ACCEPT-LANGUAGE: en-us

X-SPLUNK-SESSION: e4b300e88879490b0fb97ac7553d183f736ba109

ACCEPT-ENCODING: gzip, deflate 2011-04-02 01:26:15,292 ERROR customlogmanager:22 - [02/Apr/2011:01:26:15] HTTP Traceback (most recent call last): File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cprequest.py", line 600, in respond self.process_body() File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cprequest.py", line 722, in process_body keep_blank_values=1) File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\_cpcgifs.py", line 8, in __init__ cgi.FieldStorage.__init__(self, *args, **kwds) File "D:\Program Files\Splunk\Python-2.6\Lib\cgi.py", line 506, in __init__ self.read_urlencoded() File "D:\Program Files\Splunk\Python-2.6\Lib\cgi.py", line 607, in read_urlencoded qs = self.fp.read(self.length) File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 206, in read data = self.rfile.read(size) File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 798, in read data = self.recv(left) File "D:\Program Files\Splunk\Python-2.6\Lib\site-packages\cherrypy\wsgiserver\__init__.py", line 754, in recv return self._sock.recv(size)

error: [Errno 10054] An existing connection was forcibly closed by the remote host

The 10. Net address you see in the output is the NAT address we use on our Intranet. The server's actual address is something else, and as I mentioned, when we use that one, it works fine. We need the Intranet NAT address to work because not everyone who uses Splunk has access to come in via the firewall.

I've also tried different browsers (IE, FF, Chrome) and the same thing happens, so clearly this is something with the HTTP server itself.

Any ideas?

dwaddle · ‎04-03-2011

When there is a stateful firewall (especially one with NAT) in the middle, one tcpdump is never enough. You should capture a tcpdump on each device and compare and contrast. For example, you will need to confirm that the RST is coming from the Splunk host itself - and not being generated "helpfully" by the Checkpoint.

It is not uncommon for software defects in NAT devices to either foul up the content of a TCP session, or to get confused by the content of one.

gareth · ‎04-02-2011

Do you have/can you get a pcap dump from tcpdump of traffic to/from port 80 from the machine Splunk is running on?

NAT IP causes certain searches to crash CherryPy application server

Index This | Divide 100 by half. What do you get?

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

Splunk and Fraud