Deployment Architecture

splunk indexer: status: "pending", fully searchable: "no". How to fix?

mitag
Contributor

After updating a bucket replication policy and doing a rolling restart of cluster indexers, one of the indexers seems stuck in this state:

Indexer Clustering: status

Question: where do I go, what do I do, to figure out what's the root cause and how to fix it?

Cluster status in plaintext:
- Search Factor Not Met
- Replication Factor Not Met
- One of three indexers: Fully Searchable: No, Status: Pending.
- One out of 12 indexes shows with Searchable and Replicated Data Copies (the rest seem fine)

Under "Indexer Clustering: Service Activity", "Snapshots" - a number of "pending" tasks that seem to be stuck and never moving to "in progress" status:
- "Fixup Tasks - In Progress (0)"
- "Fixup Tasks - Pending":
-- Tasks to Meet Search Factor (4)
-- Tasks to Meet Replication Factor (6)
-- Tasks to Meet Generation (6)

Tasks to Meet Search Factor (4)
Bucket  Index   Trigger Condition   Trigger Time    Current State
_metrics~34~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before search factor fixup
_metrics~34~64AE7236-EE5E-4EEE-AEBF-203F149FCB61    _metrics    does not meet: primality & sf & rf      Waiting 'target_wait_time' before search factor fixup
_metrics~35~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before search factor fixup
_metrics~35~64AE7236-EE5E-4EEE-AEBF-203F149FCB61    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before search factor fixup

Tasks to Meet Replication Factor (6)
Bucket  Index   Trigger Condition   Trigger Time    Current State
_metrics~34~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before replicating bucket
_metrics~34~64AE7236-EE5E-4EEE-AEBF-203F149FCB61    _metrics    does not meet: primality & sf & rf      Waiting 'target_wait_time' before replicating bucket
_metrics~35~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before replicating bucket
_metrics~35~64AE7236-EE5E-4EEE-AEBF-203F149FCB61    _metrics    does not meet: sf & rf      Waiting 'target_wait_time' before replicating bucket
_metrics~36~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    non-streaming failure - src=64AE7236-EE5E-4EEE-AEBF-203F149FCB61 tgt=4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 failing=tgt
_metrics~37~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85    _metrics    non-streaming failure - src=9B5D3504-81B2-4DCC-BF4D-F7ED811A3571 tgt=4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 failing=tgt

... etc.
Indexer Clustering: Service Activity

Some of the errors on the indexer(s):

04-03-2020 07:55:10.100 -0700 ERROR TcpInputProc - event=replicationData status=failed err="Close failed"
host = bvl-mit-splkin2source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

04-03-2020 07:55:10.100 -0700 WARN  BucketReplicator - Failed to replicate warm bucket bid=_metrics~37~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 to guid=4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 host=10.101.128.89 s2sport=9887. Connection closed.
host = bvl-mit-splkin1source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

04-03-2020 07:55:10.097 -0700 WARN  S2SFileReceiver - event=processFileSlice bid=_metrics~37~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 msg='aborting on local error'
host = bvl-mit-splkin2source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

04-03-2020 07:55:10.097 -0700 ERROR S2SFileReceiver - event=onFileClosed replicationType=eJournalReplication bid=_metrics~37~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 state=eComplete src=64AE7236-EE5E-4EEE-AEBF-203F149FCB61 bucketType=warm status=failed err="bucket is already registered, registered not as a streaming hot target (SPL-90606)"
host = bvl-mit-splkin2source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

04-03-2020 07:55:10.089 -0700 WARN  BucketReplicator - Failed to replicate warm bucket bid=_metrics~36~4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 to guid=4C2AF0DE-E42F-489B-92FB-2CA3FC68AC85 host=10.101.128.89 s2sport=9887. Connection closed.
host = bvl-mit-splkin3source = /opt/splunk/var/log/splunk/splunkd.logsourcetype = splunkd

Additional notes:

Output of /opt/splunk/bin/splunk list peer-info on the peer:

slave
    base_generation_id:651
    is_registered:1
    last_heartbeat_attempt:0
    maintenance_mode:0
    registered_summary_state:3
    restart_state:NoRestart
    site:default
    status:Up

/opt/splunk/etc/master-apps/_cluster/local/indexes.conf on CM (successfully replicated to peers via /opt/splunk/bin/splunk apply cluster-bundle😞

[default]
repFactor = auto

[_introspection]
repFactor = 0

[windows]
frozenTimePeriodInSecs = 31536000
coldToFrozenDir = $SPLUNK_DB/$_index_name/frozendb

[wineventlog]
frozenTimePeriodInSecs = 31536000
coldToFrozenDir = $SPLUNK_DB/$_index_name/frozendb

Details:

  • Splunk Enterprise 8.02
  • mostly default settings

Thank you!

Labels (2)
Tags (1)
1 Solution

mitag
Contributor

Restarting all indexers (not just the one with errors) was what resolved the issue for me.

The root cause is still a mystery. Suspecting a bug or a misconfiguration:

  • that degraded _metrics index you see in the screenshot above is not supposed to exist - we aren't collecting any metrics to my knowledge, and searching for them via e.g. | mcatalog values(_dims) WHERE index=* produces no results.
  • _metrics index did seem to exist prior to this issue - but does not exist now in our environment; it's unclear why it was created originally.

View solution in original post

PranaySompalli
Explorer

We are continuing to observe this issue in version 8.2.5. Did Splunk ever fix this in the later versions? We started observing this issue after moving to SmartStore and had not observed this prior. Restarting the Cluster Manager fixes the issue but the issue happens again at a later time. 

We also do not collect metrics in our splunk environment and I can see a _metrics index on the indexer cluster for some reason

0 Karma

KaraD
Community Manager
Community Manager

Hi @PranaySompalli! Thank you for your follow-up question. Can please post your question as a new thread to help gain more visibility / up-to-date answers? Thanks!

 

-Kara, Splunk Community Manager

0 Karma

mitag
Contributor

Restarting all indexers (not just the one with errors) was what resolved the issue for me.

The root cause is still a mystery. Suspecting a bug or a misconfiguration:

  • that degraded _metrics index you see in the screenshot above is not supposed to exist - we aren't collecting any metrics to my knowledge, and searching for them via e.g. | mcatalog values(_dims) WHERE index=* produces no results.
  • _metrics index did seem to exist prior to this issue - but does not exist now in our environment; it's unclear why it was created originally.

JeremyHodgson
Explorer

Same issue here upgraded from 7.3.x to 8.0.x which added the _metrics.  Some indexers were replicating it even though repFactor = 0 and other indexer were not replicating making monitoring console upset. 

Rolling restart resolved this issue for us, thank you!

season88481
Contributor

Thanks. the rolling restart fixed my issue as well.

tariq_mohammad
Engager

Thanks, It helps me , I did rolling restart through cluster master GU Interface.

shiv1593
Communicator

Was it rolling restar of all Indexers, reboot or just service restart, that did the trick for you? We're in the same situation for Splunk version 7.3 post upgrade.

mitag
Contributor

Put the master in maintenance, rolling reboot of all indexers; checking bucket status between reboots to ensure integrity. That said - I can't remember now if I also tried manual rolling service restarts. (Automatic rolling restart from the master wasn't available due to that degraded index.)

0 Karma

shiv1593
Communicator

Oh got you. I solved it via rolling restart of the Indexers and then waiting. This is a strange bug, though.

woodcock
Esteemed Legend

Restart that Indexer and give it time. After a while, try restarting the Cluster Master. Both these actions should be safe at any time and not result in search or indexing outage, so long as only 1 Indexer is wonky.

mitag
Contributor

didn't help

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...