On my 3 indexers(which are in a cluster), sometimes the typing queue and indexing queue go almost full ( >90% or 100%) -
and those indexers indexing rate will go down(e.g. 300KB/sec | normal case it will be >3MB/sec) -
and after I restart all my indexers' splunk service it will be back to normal (means the indexing rate will be improved., queue get cleared. etc.)
How does the restart of splunk service actually improve the performance back in this case?
Is it recommended to restart the indexers (rolling-restart) when the queue/pipelines full ?
probably there are some scheduled searches very hard to execute (remember that every search takes a CPU and release it when finished so, if you have many searches with many subsearches your CPUs are alla taken by these searches and you haven't sufficient CPUs to index), so when you restart Indexers, these searches are stopped and when you restart services, your Indexers are more free and able to index correctly.
I suggest to use the Monitoring Console, to see the load on CPUs of your servers.
I have no scheduled searches in my environment. Only ad-hoc searches run by users. (that too very few)
Just would like to know does splunk cleared/wiped the queue during the restart - which means is a data loss ?
or how does it actually improve the indexing rate after restart ?
simple., what happen to the queued data during the restart ?
No I said that after restarting there aren't pending searches so your system can run mainly for indexing.
Anyway, use the Monitoring Console to monitor System load and queues.
In Alerts for Splunk Admins, I have alert AllSplunkLevel - Data Loss on shutdown (in github here) this detects the issue on shutdown of a forwarder:
I found the words: "Forcing TcpOutputGroups to shutdown after timeout"
Result in data loss from a forwarder to an indexer, I'd suspect if you see something similar on an indexer with the keyword "forcing" you might have an issue.
Do you see anything about forcing shutdown? The other way to check this would be the metrics.log just before shutdown, do the queues look like approximately zero in size before shutdown? Or close to it?
To answer your question, no, you should fix the root cause rather than restarting