We had some communication issues the past couple of days and not my index master node is telling me that my replication factor is not met for both indexing and searching, yet all nodes are up and running. I thought that this may take time so let it bake overnight and it is still in this state. Any help is much appreciated as I need to get this thing healthy and happy!
You've probably reached a point where this is no longer a problem. However, for posterity, I'll try to explain what happens:
Ordinarily, when new data arrives, the usual MO is that we create "streaming copies" of buckets as data flows into a hot bucket. These represent live copies of the data copied block by block to other indexers (up to the count of our "replication factor" (RF) as the data arrives.
The communication issues meant that when new data arrived for an index, an attempt to create a new bucket couldn't be communicated to either the cluster master or another indexer. This produces an orphaned bucket.
This bucket will remain in this state until it rolls to warm. When it's warm, it's no longer written to, and can easily be copied 1:1 to its peers to satisfy replication factor.
How to correct / rectify:
When an indexer joins the cluster (e.g. starting up), it provides a list of all of its data buckets to the cluster master.
If the CM sees a bucket that is new (to it) or doesn't yet meet RF, it will then kick off "non-streaming" copies to meet replication factor.
If the hot bucket has moved to warm already (before restarting the indexer with "orphaned" buckets), then triggering a 're-add' of the indexer may fix the situation. If the bucket hasn't been moved to warm, then forcing it to warm with the "roll-hot-buckets" trick will roll it from hot to warm, allowing it to be fixed up as a "non-streaming" copy.
Leaving the indexers alone will eventually let those hot buckets roll to warm (a number of parameters in indexes.conf govern this behavior), and once warm, they can be fixed up as "non-streaming" copies.