Deployment Architecture

splunk indexer: status: "pending", fully searchable: "no". How to fix?

Contributor

After updating a bucket replication policy and doing a rolling restart of cluster indexers, one of the indexers seems stuck in this state:

Indexer Clustering: status

Question: where do I go, what do I do, to figure out what's the root cause and how to fix it?

Cluster status in plaintext:
- Search Factor Not Met
- Replication Factor Not Met
- One of three indexers: Fully Searchable: No, Status: Pending.
- One out of 12 indexes shows with Searchable and Replicated Data Copies (the rest seem fine)

Under "Indexer Clustering: Service Activity", "Snapshots" - a number of "pending" tasks that seem to be stuck and never moving to "in progress" status:
- "Fixup Tasks - In Progress (0)"
- "Fixup Tasks - Pending":
-- Tasks to Meet Search Factor (4)
-- Tasks to Meet Replication Factor (6)
-- Tasks to Meet Generation (6)

Tasks to Meet Search Factor (4)
Bucket  Index   Trigger Condition   Trigger Time    Current State
_metrics~34~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before search factor fixup
_metrics~34~64AE7236-EE5E-4EEE-AEBF-203F149FCB61    _metrics    does not meet: primality & sf & rf      Waiting 'target_wait_time' before search factor fixup
_metrics~35~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before search factor fixup
_metrics~35~64AE7236-EE5E-4EEE-AEBF-203F149FCB61    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before search factor fixup

Tasks to Meet Replication Factor (6)
Bucket  Index   Trigger Condition   Trigger Time    Current State
_metrics~34~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before replicating bucket
_metrics~34~64AE7236-EE5E-4EEE-AEBF-203F149FCB61    _metrics    does not meet: primality & sf & rf      Waiting 'target_wait_time' before replicating bucket
_metrics~35~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before replicating bucket
_metrics~35~64AE7236-EE5E-4EEE-AEBF-203F149FCB61    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before replicating bucket
_metrics~36~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    non-streaming failure - src=64AE7236-EE5E-4EEE-AEBF-203F149FCB61 tgt=4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 failing=tgt
_metrics~37~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    non-streaming failure - src=9B5D3504-81B2-4DCC-BF4D-F7ED811A3571 tgt=4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 failing=tgt

... etc.
Indexer Clustering: Service Activity

Some of the errors on the indexer(s):

04-03-2020 07:55:10.100 -0700 ERROR TcpInputProc - event=replicationData status=failed err="Close failed"
host = bvl-mit-splkin2source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

04-03-2020 07:55:10.100 -0700 WARN  BucketReplicator - Failed to replicate warm bucket bid=_metrics~37~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 to guid=4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 host=10.101.128.89 s2sport=9887. Connection closed.
host = bvl-mit-splkin1source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

04-03-2020 07:55:10.097 -0700 WARN  S2SFileReceiver - event=processFileSlice bid=_metrics~37~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 msg='aborting on local error'
host = bvl-mit-splkin2source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

04-03-2020 07:55:10.097 -0700 ERROR S2SFileReceiver - event=onFileClosed replicationType=eJournalReplication bid=_metrics~37~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 state=eComplete src=64AE7236-EE5E-4EEE-AEBF-203F149FCB61 bucketType=warm status=failed err="bucket is already registered, registered not as a streaming hot target (SPL-90606)"
host = bvl-mit-splkin2source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

04-03-2020 07:55:10.089 -0700 WARN  BucketReplicator - Failed to replicate warm bucket bid=_metrics~36~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 to guid=4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 host=10.101.128.89 s2sport=9887. Connection closed.
host = bvl-mit-splkin3source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

Additional notes:

Output of /opt/splunk/bin/splunk list peer-info on the peer:

slave
    base_generation_id:651
    is_registered:1
    last_heartbeat_attempt:0
    maintenance_mode:0
    registered_summary_state:3
    restart_state:NoRestart
    site:default
    status:Up

/opt/splunk/etc/master-apps/_cluster/local/indexes.conf on CM (successfully replicated to peers via /opt/splunk/bin/splunk apply cluster-bundle😞

[default]
repFactor = auto

[_introspection]
repFactor = 0

[windows]
frozenTimePeriodInSecs = 31536000
coldToFrozenDir = $SPLUNK_DB/$_index_name/frozendb

[wineventlog]
frozenTimePeriodInSecs = 31536000
coldToFrozenDir = $SPLUNK_DB/$_index_name/frozendb

Details:

  • Splunk Enterprise 8.02
  • mostly default settings

Thank you!

Labels (2)
Tags (1)
0 Karma
1 Solution

Contributor

Restarting all indexers (not just the one with errors) was what resolved the issue for me.

The root cause is still a mystery. Suspecting a bug or a misconfiguration:

  • that degraded _metrics index you see in the screenshot above is not supposed to exist - we aren't collecting any metrics to my knowledge, and searching for them via e.g. | mcatalog values(_dims) WHERE index=* produces no results.
  • _metrics index did seem to exist prior to this issue - but does not exist now in our environment; it's unclear why it was created originally.

View solution in original post

Contributor

Restarting all indexers (not just the one with errors) was what resolved the issue for me.

The root cause is still a mystery. Suspecting a bug or a misconfiguration:

  • that degraded _metrics index you see in the screenshot above is not supposed to exist - we aren't collecting any metrics to my knowledge, and searching for them via e.g. | mcatalog values(_dims) WHERE index=* produces no results.
  • _metrics index did seem to exist prior to this issue - but does not exist now in our environment; it's unclear why it was created originally.

View solution in original post

Communicator

Was it rolling restar of all Indexers, reboot or just service restart, that did the trick for you? We're in the same situation for Splunk version 7.3 post upgrade.

Contributor

Put the master in maintenance, rolling reboot of all indexers; checking bucket status between reboots to ensure integrity. That said - I can't remember now if I also tried manual rolling service restarts. (Automatic rolling restart from the master wasn't available due to that degraded index.)

0 Karma

Communicator

Oh got you. I solved it via rolling restart of the Indexers and then waiting. This is a strange bug, though.

Esteemed Legend

Restart that Indexer and give it time. After a while, try restarting the Cluster Master. Both these actions should be safe at any time and not result in search or indexing outage, so long as only 1 Indexer is wonky.

0 Karma

Contributor

Done it before, done it again just now - didn't help.

0 Karma