Re: Unknown error for indexer: XXXX . Search Resul...

robertlynch2020 · ‎10-29-2020

Hi

I am migrating from a single install to a cluster 1SH + 1MD + 3 Indexers.

When we are trying a load test - 5 heavy screens in parallel we are getting the following errors - this was not the case in the signal install and we think perhaps we are missing a prop?

[subsearch]: Unknown error for indexer: hp925srv_INDEXER4. Search Results might be incomplete! If this occurs frequently, check on the peer.

Unable to distribute to peer named 10.25.57.21:8089 at uri=10.25.57.21:8089 using the uri-scheme=https because peer has status=Down. Verify uri-scheme, connectivity to the search peer, that the search peer is up, and that an adequate level of system resources are available. See the Troubleshooting Manual for more information.

[subsearch]: Error connecting: Connect Timeout

Regards

Robert

robertlynch2020 · ‎10-29-2020

Hi

Thanks for your comments.

The peer is working when we open one screen, but when we increase it to 5 to do a load test we get the message. All machines are on 56 Core with lots of RAM.

A heave screen is one that runs 40 searches when it is opened. However, 20 are run in less than 1 seconds and about 5 take 5 seconds to complete normally.

Regards

Robert

isoutamo · ‎10-29-2020

What kind of disk subsystem you have on those nodes and how many IOPS it gives to you?

robertlynch2020 · ‎10-30-2020

Hi - Thanks for the question.

We are 11TB of SSD on each node and no subsystem.

I am assuming this is ok and should give me the performance I need?

Rob

richgalloway · ‎10-29-2020

This doesn't happen on standalone instances because they don't use distributed search.

Have you checked the indexer at 10.25.57.21? Is it up and listening on port 8089? Is a firewall blocking communications to that address/port? Does the indexer run low on resources when processing 5 heavy screens? What is a "heavy screen"?

---
If this reply helps you, Karma would be appreciated.

robertlynch2020 · ‎10-30-2020

Hi

Thanks for your questions.

Yes the peer is up on 10.25.57.21 and is working when we load in a screen on its own, and it is fast - just like production. A heavy screen can load 40 searches in parallel. 20 are finished in 1 second and about 5 take 20 seconds when loaded as a single screen and not part of a load test (load test = 5 screens in parallel).

The CPU and RAM on the INDEXER do not move much, the network.

when we open up other screens we are getting "Waiting for queued job to start" - but we have given this user a lot of capacity...

Below is the network activity before and after the test. I am not sure if this is ok or not

agneticdk · ‎09-09-2021

Hi, did you ever get this fixed, or any response ?

We see the same problem with "Unknown error" and also "Broken pipes" etc, on several Splunk installations that are on different versions.

I thought this was related to some networking, but I see the same pattern on several installations that are not part of the same infrastructure.

Servers are highly specced both cpu, memory, network and disk wise. All OS and Splunk paramters set by the book.

Just wondering if you have found a fix or a source for the problems.

isoutamo · ‎09-09-2021

Hi

there are many reasons why this can happened. Some are based on slunk architecture and some could be an errors. Can you describe your environment and what you could saw on your MC on these times?

r. Ismo

agneticdk · ‎09-09-2021

No errors in the monitoring console. I see this only in distributed setups, so no errors when searching locally on a all-in-one box.

The worst installation has 1500+ broken pipes on these (search: 😞

09-09-2021 13:32:18.574 +0200 WARN HttpListener - Socket error from "search-head-ip":42800 while accessing /services/streams/search: Broken pipe

and

this type ERROR:

09-09-2021 13:40:33.742 +0200 ERROR HttpListener - Exception while processing request from "search-head-ip":45930 for /services/search/jobs/remote_splunksh03."domainname"_subsearch_nested_8294df4a2b86339a_1631187558.3/search.log: Broken pipe

Other installations have between 20 and 50 of those pr 24 hours.

These warnings and errors are logged a lot when searches start having errors like this (from search.log):

09-09-2021 14:00:45.634 ERROR SearchResultTransaction - Got status 502 from https://"indexer-ip":8089/services/streams/search?sh_sid=1631188834.20962_7D6DF087-C582-4B67-A82D-BD1F18B5BEA5
09-09-2021 14:00:45.634 INFO  XmlParser - Entity: line 1: parser: Document is empty
09-09-2021 14:00:45.635 ERROR SearchResultParser - HTTP error status message from https://"indexer-ip":8089/services/streams/search?sh_sid=1631188834.20962_7D6DF087-C582-4B67-A82D-BD1F18B5BEA5: Error connecting: Connect Timeout
09-09-2021 14:00:45.635 WARN  SearchResultParserExecutor - Error connecting: Connect Timeout for collector=splunkidx01.domainname
09-09-2021 14:00:45.635 ERROR DispatchThread - sid:1631188834.20962_7D6DF087-C582-4B67-A82D-BD1F18B5BEA5 Unknown error for indexer: splunkidx01.domainname. Search Results might be incomplete! If this occurs frequently, check on the peer.

The peers are always up, the ulimit, thread limit and socket limit are all OK when viewing in the splunkd log when splunk is starting. (It is systemd managed on an ubuntu 20.02LTS, so ulimit set there, and is 65535)

The servers (sh's and indexers) are 80 cores (40 hyperthreaded, dual socket), 386 GB ram, pure SSD.

Seach head cluster with 3 nodes

Indexer cluster with 4 nodes.

We see no other network errors. No problems with 9887 replication data in the indexer cluster, no problems ingesting data.

Unknown error for indexer: XXXX . Search Results might be incomplete! If this occurs frequently, check on

other

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!