Regards
Robert
Hi
Thanks for your comments.
The peer is working when we open one screen, but when we increase it to 5 to do a load test we get the message. All machines are on 56 Core with lots of RAM.
A heave screen is one that runs 40 searches when it is opened. However, 20 are run in less than 1 seconds and about 5 take 5 seconds to complete normally.
Regards
Robert
Hi - Thanks for the question.
We are 11TB of SSD on each node and no subsystem.
I am assuming this is ok and should give me the performance I need?
Rob
This doesn't happen on standalone instances because they don't use distributed search.
Have you checked the indexer at 10.25.57.21? Is it up and listening on port 8089? Is a firewall blocking communications to that address/port? Does the indexer run low on resources when processing 5 heavy screens? What is a "heavy screen"?
Hi
Thanks for your questions.
Yes the peer is up on 10.25.57.21 and is working when we load in a screen on its own, and it is fast - just like production. A heavy screen can load 40 searches in parallel. 20 are finished in 1 second and about 5 take 20 seconds when loaded as a single screen and not part of a load test (load test = 5 screens in parallel).
The CPU and RAM on the INDEXER do not move much, the network.
when we open up other screens we are getting "Waiting for queued job to start" - but we have given this user a lot of capacity...
Below is the network activity before and after the test. I am not sure if this is ok or not
Hi, did you ever get this fixed, or any response ?
We see the same problem with "Unknown error" and also "Broken pipes" etc, on several Splunk installations that are on different versions.
I thought this was related to some networking, but I see the same pattern on several installations that are not part of the same infrastructure.
Servers are highly specced both cpu, memory, network and disk wise. All OS and Splunk paramters set by the book.
Just wondering if you have found a fix or a source for the problems.
Hi
there are many reasons why this can happened. Some are based on slunk architecture and some could be an errors. Can you describe your environment and what you could saw on your MC on these times?
r. Ismo
No errors in the monitoring console. I see this only in distributed setups, so no errors when searching locally on a all-in-one box.
The worst installation has 1500+ broken pipes on these (search: 😞
09-09-2021 13:32:18.574 +0200 WARN HttpListener - Socket error from "search-head-ip":42800 while accessing /services/streams/search: Broken pipe
and
this type ERROR:
09-09-2021 13:40:33.742 +0200 ERROR HttpListener - Exception while processing request from "search-head-ip":45930 for /services/search/jobs/remote_splunksh03."domainname"_subsearch_nested_8294df4a2b86339a_1631187558.3/search.log: Broken pipe
Other installations have between 20 and 50 of those pr 24 hours.
These warnings and errors are logged a lot when searches start having errors like this (from search.log):
09-09-2021 14:00:45.634 ERROR SearchResultTransaction - Got status 502 from https://"indexer-ip":8089/services/streams/search?sh_sid=1631188834.20962_7D6DF087-C582-4B67-A82D-BD1F18B5BEA5 09-09-2021 14:00:45.634 INFO XmlParser - Entity: line 1: parser: Document is empty 09-09-2021 14:00:45.635 ERROR SearchResultParser - HTTP error status message from https://"indexer-ip":8089/services/streams/search?sh_sid=1631188834.20962_7D6DF087-C582-4B67-A82D-BD1F18B5BEA5: Error connecting: Connect Timeout 09-09-2021 14:00:45.635 WARN SearchResultParserExecutor - Error connecting: Connect Timeout for collector=splunkidx01.domainname 09-09-2021 14:00:45.635 ERROR DispatchThread - sid:1631188834.20962_7D6DF087-C582-4B67-A82D-BD1F18B5BEA5 Unknown error for indexer: splunkidx01.domainname. Search Results might be incomplete! If this occurs frequently, check on the peer.
The peers are always up, the ulimit, thread limit and socket limit are all OK when viewing in the splunkd log when splunk is starting. (It is systemd managed on an ubuntu 20.02LTS, so ulimit set there, and is 65535)
The servers (sh's and indexers) are 80 cores (40 hyperthreaded, dual socket), 386 GB ram, pure SSD.
Seach head cluster with 3 nodes
Indexer cluster with 4 nodes.
We see no other network errors. No problems with 9887 replication data in the indexer cluster, no problems ingesting data.