Deployment Architecture

How do I figure out what the real problem is with my index cluster?

duke_splunk_adm
Engager

After several years the replication factor on my 6.6.3 index cluster recently changed to 'not met'. it has been fine in the past, and I can see buckets replicating among the 4 members of the cluster. I'm able to see open connections on :9887 among the members, they show up in each others' splunkd.log as successful replications, and nothing has changed configuration-wise or even version-wise. I've got two members of the cluster with 200G less room than the other two (each), and I cannot find anything that can help me figure out what the problem is.
The monitoring console says 'Not Met', the show cluster-config says 'not met', and the --verbose output on the master looks like this for every entry, with the Replicated copies and Searchable copies trackers showing the same numbers all across the board:

network_wireless_aps
         Number of non-site aware buckets=0
         Number of buckets=28
         Size=366897499
         Searchable  YES
         Replicated copies tracker
                28/28                   28/28
         Searchable copies tracker
                28/28                   28/28

 network_wireless_controllers
         Number of non-site aware buckets=0
         Number of buckets=35
         Size=19069528111
         Searchable  YES
         Replicated copies tracker
                35/35                   35/35
         Searchable copies tracker
                35/35                   35/35

(the same is true for the search-factor, but I figure if one gets better the other will too)

0 Karma

Lucas_K
Motivator

On your cluster master gui click the indexes tab and click bucket status. Have "select index" on "all".

Any "in progress" or "pending" jobs?

If so have a look at the time in fix up and the current status. This will give a hint as to why your rep/search factors are not met.

duke_splunk_adm
Engager

I've been checking by comparing byte size of the various directories in $SPLUNK_HOME/var/lib/splunk and they are wildly different across the four machines.
After that rebalance command, however, they are more out of balance than before.
I'm really stumped, and the disk volume on two of the four boxes is approaching 95% full, enough to trigger alarms. The other two are holding pretty steady at 88% full.

0 Karma

duke_splunk_adm
Engager

Something must have worked its way out eventually with that rebalance command, because an hour after it finished all 4 indexers have a lower disk utilization. I think I've fixed it even though I don't know what happened.

0 Karma

duke_splunk_adm
Engager

The bit I forgot to say is that it's been like this for a week now. That's a bit long, even for 200G worth of buckets, yes?
I haven't ever enabled the web console on the master, so I found a couple of CLI commands to try to look for pending jobs and there weren't any that I could find.

I did just kick off a 'rebalance cluster-data', though, I hadn't tried that yet. We'll see what happens.

0 Karma