Hi,
We have a Splunk app which exposes a REST end point for other application to request metrics.
The main piece of python code inside the method is:
service = self.getService()
searchjob = service.jobs.create(searchquery)
while not searchjob.is_done():
time.sleep(5)
reader = results.ResultsReader(searchjob.results(count=0))
response_data = {}
response_data["results"] = []
for result in reader:
if isinstance(result, dict):
response_data["results"].append(result)
elif isinstance(result, results.Message):
mylogger.info("action=runSearch, search = %s, msg = %s" % (searchquery, results.Message))
search_dict["searchjob"] = searchjob
search_dict["searchresults"] = json.dumps(response_data)
The dependent application invokes the REST API at some scheduled intervals. There are close to 150 calls that is spread across various time intervals.
Note: At any point of time there will be maximum 6 search requests.
Normal scenarios
Remote application and my Splunk app - both are up and running - everything is fine.
For some reason if I have to restart remote application, and after restart - both are up and running - everything is fine.
For some reason if I have to restart my Splunk process, and after restart - both the applications are up and running - everything is fine.
Problematic scenario:
The problem starts when the system where remote application is running is rebooted. After reboot, the remote application will start making calls to the splunk application and in about 60 min, the number of CLOSE_WAIT connections reaches 700+ and eventually splunk system starts throwing socket error. Splunk Web will also become inaccessible.
Additional Info:
The remote application is a python application written using Tornado framework. The remote application runs inside a docker container that is managed by Kubernetes.
The ulimit -n on splunk system shows 1024. (I know that as per Splunk recommendation it is less. But i would like to understand why the issue occurs only during remote systemreboot)
During normal times, the searches take on an average 7s to complete. When the remote machine is rebooted during that time the searches take on an average 14s to complete. (Well this may not make sense to relate remote system reboot with splunk search performance on the splunk system. But thats the trend)
The CLOSE_WAIT connections are all internal tcp connections
tcp 1 0 127.0.0.1:8089 127.0.0.1:37421 CLOSE_WAIT 0 167495826 28720/splunkd
tcp 1 0 127.0.0.1:8089 127.0.0.1:32869 CLOSE_WAIT 0 167449474 28720/splunkd
tcp 1 0 127.0.0.1:8089 127.0.0.1:37567 CLOSE_WAIT 0 167497280 28720/splunkd
tcp 1 0 127.0.0.1:8089 127.0.0.1:33086 CLOSE_WAIT 0 167451533 28720/splunkd
Any help or pointers is highly appreciated.
Thanks,
Strive
... View more