After several years the replication factor on my 6.6.3 index cluster recently changed to 'not met'. it has been fine in the past, and I can see buckets replicating among the 4 members of the cluster. I'm able to see open connections on :9887 among the members, they show up in each others' splunkd.log as successful replications, and nothing has changed configuration-wise or even version-wise. I've got two members of the cluster with 200G less room than the other two (each), and I cannot find anything that can help me figure out what the problem is.
The monitoring console says 'Not Met', the show cluster-config says 'not met', and the --verbose output on the master looks like this for every entry, with the Replicated copies and Searchable copies trackers showing the same numbers all across the board:
network_wireless_aps Number of non-site aware buckets=0 Number of buckets=28 Size=366897499 Searchable YES Replicated copies tracker 28/28 28/28 Searchable copies tracker 28/28 28/28 network_wireless_controllers Number of non-site aware buckets=0 Number of buckets=35 Size=19069528111 Searchable YES Replicated copies tracker 35/35 35/35 Searchable copies tracker 35/35 35/35
(the same is true for the search-factor, but I figure if one gets better the other will too)
On your cluster master gui click the indexes tab and click bucket status. Have "select index" on "all".
Any "in progress" or "pending" jobs?
If so have a look at the time in fix up and the current status. This will give a hint as to why your rep/search factors are not met.
I've been checking by comparing byte size of the various directories in $SPLUNK_HOME/var/lib/splunk and they are wildly different across the four machines.
After that rebalance command, however, they are more out of balance than before.
I'm really stumped, and the disk volume on two of the four boxes is approaching 95% full, enough to trigger alarms. The other two are holding pretty steady at 88% full.
Something must have worked its way out eventually with that rebalance command, because an hour after it finished all 4 indexers have a lower disk utilization. I think I've fixed it even though I don't know what happened.
The bit I forgot to say is that it's been like this for a week now. That's a bit long, even for 200G worth of buckets, yes?
I haven't ever enabled the web console on the master, so I found a couple of CLI commands to try to look for pending jobs and there weren't any that I could find.
I did just kick off a 'rebalance cluster-data', though, I hadn't tried that yet. We'll see what happens.