Splunk Search

Unknown error for peer xxx. Search Results might be incomplete

cwl
Contributor

Splunk 6.4.2のSearch head 2台、Indexer 12台の分散環境を使っていますが、時間がかかるサーチを実行するとUI上に以下のエラーが表示されることがありますが、エラーが表示される原因および解決方法を教えてください。

Unknown error for peer indexer1. Search Results might be incomplete. If this occurs frequently, please check on the peer.
Unknown error for peer indexer2. Search Results might be incomplete. If this occurs frequently, please check on the peer.
Unknown error for peer indexer3. Search Results might be incomplete. If this occurs frequently, please check on the peer.

I am using Splunk 6.4.2 with 2 search heads and 12 indexers. Sometimes I see below errors in UI when running searches which last long time to finish. Please let me know the cause and how to fix it.

Unknown error for peer indexer1. Search Results might be incomplete. If this occurs frequently, please check on the peer.
Unknown error for peer indexer2. Search Results might be incomplete. If this occurs frequently, please check on the peer.
Unknown error for peer indexer3. Search Results might be incomplete. If this occurs frequently, please check on the peer.
0 Karma
1 Solution

cwl
Contributor

DEBUG情報が出力されたsearch.logに以下のようなメッセージが表示されている場合、SPL-124350の問題が発生している可能性があります。

DEBUG DistributedSearchResultCollector - Pausing transaction on write to uri: http://xx.xx.xx.xx:8089/services/streams/search?sh_sid=yyyyyyyyy.zzzz

エラーの原因は、search headがindexerから返された検索結果の処理に追いつかず、search head側のqueueが一時停止(pausing)してしまい、12秒以上に経ってもsearch head側からindexer側に通信がないときに、indexerがsearch headとのソケットをクローズしたため、エラーが表示されている可能性が高いです。
回避策としては、search head側のlimits.conf内のmax_chunk_queue_sizeを増やすことで回避できます。

[search]
max_chunk_queue_size = 30000000 

また、6.4.5からは以下のパラメータがserver.confに追加されたため、indexer側のこれらのパラメータを増やすことで今回の問題を解決できます。

[httpServer]
keepAliveIdleTimeout = 7200
busyKeepAliveIdleTimeout = 12 

There are 2 queues in search head side. One is the application layer queue (max_chunk_queue_size) and the other the TCP layer queue (TCP recv queue).
When users run a search in search head side, it will issue the search to indexers and indexers will search for the events in their local buckets.
Those search results will be send to TCP layer's TCP send queue of indexers which will go to search head side's TCP recv queue.
If indexers are sending large amount of result to search head and it can not keep up reading those results then chunk queue will fill up and search head will pause from getting the search result from queue.
Now if indexers have already done the work of sending all search result to TCP send queue then the connection will be sitting idle for the indexer.
Indexer will wait for 12 second in idle state before closing the socket and you will get some inconsistencies in search result if the sockets is being closed abruptly by indexer.
SPL-124350 has fixed this issue by adding a new parameter call busyKeepAliveIdleTimeout in server.conf so you can tune this 12 second value.
The parameter has been added to version 6.3.9, 6.4.5 and 6.5.1.
Please refer below documentation for details about these 2 parameters.
http://docs.splunk.com/Documentation/Splunk/6.4.5/Admin/Serverconf

If you can not upgrade your indexers then as workaround you can increase max_chunk_queue_size of your search head in limits.conf

 [search]
 max_chunk_queue_size = 30000000 

View solution in original post

cwl
Contributor

DEBUG情報が出力されたsearch.logに以下のようなメッセージが表示されている場合、SPL-124350の問題が発生している可能性があります。

DEBUG DistributedSearchResultCollector - Pausing transaction on write to uri: http://xx.xx.xx.xx:8089/services/streams/search?sh_sid=yyyyyyyyy.zzzz

エラーの原因は、search headがindexerから返された検索結果の処理に追いつかず、search head側のqueueが一時停止(pausing)してしまい、12秒以上に経ってもsearch head側からindexer側に通信がないときに、indexerがsearch headとのソケットをクローズしたため、エラーが表示されている可能性が高いです。
回避策としては、search head側のlimits.conf内のmax_chunk_queue_sizeを増やすことで回避できます。

[search]
max_chunk_queue_size = 30000000 

また、6.4.5からは以下のパラメータがserver.confに追加されたため、indexer側のこれらのパラメータを増やすことで今回の問題を解決できます。

[httpServer]
keepAliveIdleTimeout = 7200
busyKeepAliveIdleTimeout = 12 

There are 2 queues in search head side. One is the application layer queue (max_chunk_queue_size) and the other the TCP layer queue (TCP recv queue).
When users run a search in search head side, it will issue the search to indexers and indexers will search for the events in their local buckets.
Those search results will be send to TCP layer's TCP send queue of indexers which will go to search head side's TCP recv queue.
If indexers are sending large amount of result to search head and it can not keep up reading those results then chunk queue will fill up and search head will pause from getting the search result from queue.
Now if indexers have already done the work of sending all search result to TCP send queue then the connection will be sitting idle for the indexer.
Indexer will wait for 12 second in idle state before closing the socket and you will get some inconsistencies in search result if the sockets is being closed abruptly by indexer.
SPL-124350 has fixed this issue by adding a new parameter call busyKeepAliveIdleTimeout in server.conf so you can tune this 12 second value.
The parameter has been added to version 6.3.9, 6.4.5 and 6.5.1.
Please refer below documentation for details about these 2 parameters.
http://docs.splunk.com/Documentation/Splunk/6.4.5/Admin/Serverconf

If you can not upgrade your indexers then as workaround you can increase max_chunk_queue_size of your search head in limits.conf

 [search]
 max_chunk_queue_size = 30000000 
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...