We have a farm that is going to be retired in a couple of months.
The cluster master hasn't been doing well at all - Why is the indexer cluster master being marked as down consistently?
Support just told us -
-- The Cluster Master is desperately in need of additional resources, 2 cores and 8 GB of memory is not going to be sufficient.
Since there is no chance for us to get approval for additional resources on this VM, I wonder what can be done to alleviate the load on this cluster master?
2 core and 8gb is not going to cut it... but there are some configs we can try to tinker with (no promises):
indexers server.conf
heartbeat_period: 1->10
cxn_timeout = 60->300
send_timeout = 60->300
rcv_timeout = 60-> 300
CM server.conf
heartbeat_timeout = 60->300
max_fixup_time_ms = 5000
cxn_timeout = 60->300
send_timeout = 60->300
rcv_timeout = 60-> 300
2 core and 8gb is not going to cut it... but there are some configs we can try to tinker with (no promises):
indexers server.conf
heartbeat_period: 1->10
cxn_timeout = 60->300
send_timeout = 60->300
rcv_timeout = 60-> 300
CM server.conf
heartbeat_timeout = 60->300
max_fixup_time_ms = 5000
cxn_timeout = 60->300
send_timeout = 60->300
rcv_timeout = 60-> 300
Much appreciated @dxu_splunk !!
My understanding of the issue is that the Cluster Master is having trouble coordinating your Search and Replication factors among the peers. So, even if you disable indexing _internal (which I promise you WILL regret doing that) you will eventually see this happen as bucket load increases with data volume.
Is your search factor and replication factor wildly high? Did you mess with the size of buckets? Both of those tuning could be causing your more issues.
At the end of the day, the software was designed for minimum specifications that are not being provided. If it helps sell your need for more power: a car can't really drive well on one wheel if it requires four.
Makes perfect sense @SloshBurch - thank you.
For the sake of completeness from the CM -
$ grep CMPeer < splunkd.log.5.instability | wc -l
71999
It covers this time frame -
10-03-2018 01:11:11.213 -0500
10-03-2018 01:11:59.532 -0500
The messages look like -
10-03-2018 01:11:59.532 -0500 INFO CMPeer - peer=12E6ED7C-9765-46F1-8883-5F34834E82F4 peer_name=<indexer> bid=<index name>~4485~3CA07398-A043-4E1E-BA20-233C66372471 transitioning from=Searchable to=SearchablePendingMask oldmask=0x4 newmask=0x5 reason="swap primaries"