Solved: Index stop and duplicate using UF

cipherjake · ‎09-05-2014

I perform data entry setting and input data of 40GB. However, input stopped on the way and reopened when it passed for a while. Performance of the data entry deteriorated after the reopening.

Splunkで利用するデータの初期データ移行を行っています。40GB程度のログを取り込んだ際にある時点でIndexingが停止してしまいました。おそらくForwarderからの転送ができなくなったのだと思われます。(詳細は下記エラーメッセージ参照)

Before setting

outputs.conf:
autoLB(LB to indexer 30 servers)
useACK = true
autoLBFrequency=20

limits.conf
maxKBps = 256

[What have been done]
I confirmed contents of splunkd.log
-> A large quantity of WARN occurred

WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for xxxxxx seconds.
(xxxxx -> 100-24800)

上記エラーメッセージが出力され続け、その間はForwarderからデータが転送できていない状態でした。ある程度時間が経過するとデータ転送が再開されましたが、パフォーマンスが著しく低下しており、データのduplicateも発生するようになりました。

[Splunk architecture]
Splunk Enterprise6.1.3
Search Head:4 servers
Indexer:30 servers(Cluster/SF:2,RF:3)
Master/deploymentserver:1 server(mix roll)
Universal Forwarder:2 servers

エラーメッセージを調査しましたが有力な情報は得られませんでした。何かご存知の方がいましたら情報提供をお願い致します。

2014/09/12 追加情報
エラー発生時にIndexserの状態で気になる点がありましたので共有させていただきます。
・一部Indexerでメモリを大量に使用(8GBのぎりぎりまで使っていました。ログインも非常に遅い)
・BucketsReplicatorのエラー(クラスタ構成が一時的にSF,RFを満たせませんでした)
・サーチも少し遅く感じる(データ入力中、データ入力後)
可能性としてはNetworkの疎通に見えますが、単純にWindowsサーバーが上手く応答を返していないのが原因ではないかと
考えています。telnet,pingは通るので、推測ですが・・・。

Please let me share it because there was a point to be worried about in a state of Indexser at the time of error outbreak.
・I use memory in large quantities in some Indexer.(I used it to the limit of 8GB. /the login is very slow)
・Error of BucketsReplicator(cluster constitution was not able to satisfy SF, RF temporarily)
・I feel some searches to be late (during data entry after data entry)
I see it for understanding of Network for possibility, but think that it is a cause that Windows server does not return a reply well simply. Because telnet,ping goes, is a guess,; but ...

Suda · ‎09-16-2014

I doubt that Indexer's HDD performance might NOT be enough in your environment.
Your indexer's metrics.log (queue) may show some evidences of root causes. I ask you to check.
Splunk on Splunk App has a dashboard to check your queue status.

What kind of storage do you use for index volumes?

Do your 30 indexers create indexes on an NAS storage?
Do you deploy indexers on your VM environments?

If yes, I think you might need to re-build your Splunk environemt.

I hope it helps you.

インデクサーのHDD性能が十分でない可能性が高いと思います。
Indexerのmetrics.log(キュー情報)に根本原因のヒントが見つけられると思います。
Splunk on Splunk Appには、キュー情報を確認できるダッシュボードが用意されています。

どのようにインデックス用のストレージを構成されていますか？

30台のIndexerが共通のNAS上にインデックスを作成していませんか？
VM環境にIndexerを構築されていませんか？

もし、その通りなら、Splunkのサイジング・構成を見直す必要があると思います。

ご参考になれば、幸いです。

View solution in original post

Suda · ‎09-17-2014

One more thing...

I recommend you to turn off "useAck" feature for issue isolation.

The "useAck" requires more resources of Indexers and forwarders to manage delivery status.

I hope it helps you.

問題の切り分けのために、useAckを無効化してはいかがでしょうか？

useAck機能は送達管理のためにインデクサー、フォワーダー共により多くのリソースを必要とします。

ご参考となれば幸いです。

cipherjake · ‎10-30-2014

コメントありがとうございました。
結果としてディスクの性能が不足していると判断しました。
しかし、ディスクの性能が不足してIndexingが遅れるのは分かりますが、一部書き込まれている状態でtimeoutと判断して別のindexerにデータを転送してしまいdataのduplicateが発生してしまいました。timeout値をチューニングすれば良いとは思うのですが、データ書き込みされたものを再送してしまうのは何とかならないか模索しています。

※useACKはデータ転送の信頼性を確保するためにONにしています。

色々とご教示ありがとうございました。

Suda · ‎09-16-2014

I doubt that Indexer's HDD performance might NOT be enough in your environment.
Your indexer's metrics.log (queue) may show some evidences of root causes. I ask you to check.
Splunk on Splunk App has a dashboard to check your queue status.

What kind of storage do you use for index volumes?

Do your 30 indexers create indexes on an NAS storage?
Do you deploy indexers on your VM environments?

If yes, I think you might need to re-build your Splunk environemt.

I hope it helps you.

インデクサーのHDD性能が十分でない可能性が高いと思います。
Indexerのmetrics.log(キュー情報)に根本原因のヒントが見つけられると思います。
Splunk on Splunk Appには、キュー情報を確認できるダッシュボードが用意されています。

どのようにインデックス用のストレージを構成されていますか？

30台のIndexerが共通のNAS上にインデックスを作成していませんか？
VM環境にIndexerを構築されていませんか？

もし、その通りなら、Splunkのサイジング・構成を見直す必要があると思います。

ご参考になれば、幸いです。

cipherjake · ‎11-20-2014

With new H/W, we have increased the value for the "maxKBps" parameter and verified.
However it seemed that the performance was not improved than with old H/W.

We think that other parameters may need to be adjusted.

改めてDIskの性能を満たした新H/Wに変更してForwarder側の"maxKBps"のパラメータ値を増加して検証しましたが、旧H/Wと比較してそれほど性能が変わっていないようにみえました。他のパラメータの調整(queue size等)が必要なのでしょうか？

Suda · ‎12-01-2014

Thank you for updated information.
After you changed your HW, but you could not see any improved performance. I see...
You may still have some bottle neck points.
At first, could you share your HW in detail (indexer, indexer's hdd, forwarder)?
If I were you, I would check queue status while indexing to find bottle neck points.

I believe SoS App may help you to confirm queue status, as I and others mentioned before

更新情報ありがとうございます。
HWを変えても、性能に変化が見られなかったと言う事ですね・・・不思議です。どこかにボトルネックがあると思われます。
どのようなハードウェア・構成なのか、具体的に共有いただけないでしょうか？
私なら、ボトルネックを特定するために、インデックス時のキューの状態を確認します。
それには、既出ですが、SoS Appが役に立つと思います。

s2_splunk · ‎09-11-2014

The first thing I would do is to remove your throughput limit set in limits.conf:
maxKBps = 256 means you are not getting more than 256KBps out the door. Try removing the limit by setting
maxKBps = 0

If you are still seeing messages about queues being blocked, your indexer simply cannot keep up. This can have various reasons, but most likely Splunk either has issues parsing your events (timestamp extraction, etc.) or it cannot write to disk fast enough.

Can you share the hardware specs (including disk) for the machines running your indexers?
Do you have the Splunk on Splunk (S.o.S.) app installed in your environment? If not, I recommend you do that as it gives you valuable information about the indexing performance of your cluster.

Index stop and duplicate using UF

We think that other parameters may need to be adjusted.

I believe SoS App may help you to confirm queue status, as I and others mentioned before

CX Day is Coming!

Strengthen Your Future: A Look Back at Splunk 10 Innovations and .conf25 Highlights!

Now Offering the AI Assistant Usage Dashboard in Cloud Monitoring Console

Are you a member of the Splunk Community?

Index stop and duplicate using UF

We think that other parameters may need to be adjusted.

I believe SoS App may help you to confirm queue status, as I and others mentioned before

CX Day is Coming!

Strengthen Your Future: A Look Back at Splunk 10 Innovations and .conf25 Highlights!

Now Offering the AI Assistant Usage Dashboard in Cloud Monitoring Console