We have a single-site indexer cluster with 2 indexers and one cluster master.
We are seeing some issues related to one of the indexers which are not getting replicated after a reboot of the search head.
Splunk settings and conditions:
Splunk Version: 6.3.1
SF/RF are not met
Clustering: single-site
Each indexer and master has 12 cores and sufficient memory of 1TB
Ulimit = 102400
TH is disabled
/opt/splunk/bin/splunk show cluster-status
Replication factor not met
Search factor not met
All data is searchable
Indexing Ready = YES
reporting2.com-2-slave 05035B22-ECA4-4514-96AC-BE3BDF626D84 default
Searchable YES
Status Up
Bucket Count=713
reporting1.com-1-slave E7B1F3CE-FE08-454D-B41D-ED0346DE3671 default
Searchable YES
Status Up
Bucket Count=686
Telnet to both the indexers shows:
Connected to 10.XXX.XX
Things I tried:
Tuning some parameters in server.conf
as
heartbeat_timeouot=600 onCM
heartbeat_period = 10 on peers.alt text
Haven't done any reporting data restore.
The data was freshly indexed.
Cluster master logs:
id=ib_threatdb_a~33~05035B22-ECA4-4514-96AC-BE3BDF626D84 tgtGuid=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 tgtHP=169.XXX.0.4:7089 tgtRP=7887 useSSL=false
04-29-2020 11:02:18.094 +0000 INFO CMMaster - replication error src=05035B22-ECA4-4514-96AC-BE3BDF626D84 tgt=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 failing=tgt bid=ib_threatdb_a~33~05035B22-ECA4-4514-96AC-BE3BDF626D84
04-29-2020 11:02:18.094 +0000 INFO CMReplicationRegistry - Finished replication: bid=ib_threatdb_a~33~05035B22-ECA4-4514-96AC-BE3BDF626D84 src=05035B22-ECA4-4514-96AC-BE3BDF626D84 target=E7B1F3CE-FE08-454D-B41D-ED0346DE3671
04-29-2020 11:02:18.094 +0000 INFO CMPeer - peer=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 peer_name=reporting1.com-1-slave transitioning from=Up to=Pending reason="non-streaming failure"
04-29-2020 11:02:18.094 +0000 INFO CMMaster - event=handleReplicationError bid=ib_threatdb_a~33~05035B22-ECA4-4514-96AC-BE3BDF626D84 tgt=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 peer_name=reporting1.com-1-slave msg='target doesn't have bucket now. ignoring'
After some time starting to see these logs getting flooded:
04-28-2020 15:43:08.302 +0000 INFO CMPeer - peer=05035B22-ECA4-4514-96AC-BE3BDF626D84 peer_name=reporting2.com-2-slave transitioning from=Pending to=Up reason="heartbeat received."
Reporting 1 indexer logs:
04-29-2020 14:03:17.588 +0**000 ERROR RawdataHashMarkReader - Error parsing rawdata inside bucket** path="/opt/splunk/var/lib/splunk/ib_threatdb_a/db/rb_1588024289_1588024289_1848_E7B1F3CE-FE08-454D-B41D-ED0346DE3671": msg="Bad opcode: 2B"
04-29-2020 14:03:17.588 +0000 INFO BucketReplicator - Created asyncReplication task to replicate bucket ib_threatdb_a~1848~E7B1F3CE-FE08-454D-B41D-ED0346DE3671 to guid=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 host=169.254.0.4 s2sport=7887 bid=ib_threatdb_a~1848~E7B1F3CE-FE08-454D-B41D-ED0346DE3671
04-29-2020 14:03:17.588 +0000 INFO BucketReplicator - event=startBucketReplication bid=ib_threatdb_a~1848~E7B1F3CE-FE08-454D-B41D-ED0346DE3671
04-29-2020 14:03:17.588 +0000 WARN BucketReplicator - **Failed to replicate warm bucket bid=ib_threatdb_a~1850~E7B1F3CE-FE08-454D-B41D-ED0346DE3671 to guid=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 host=169.254.0.4 s2sport=7887.** Connection closed.
04-29-2020 14:03:17.588 +0000 INFO CMReplicationRegistry - Finished replication: bid=ib_threatdb_a~1850~E7B1F3CE-FE08-454D-B41D-ED0346DE3671 src=05035B22-ECA4-4514-96AC-BE3BDF626D84 target=E7B1F3CE-FE08-454D-B41D-ED0346DE3671
04-29-2020 14:03:17.588 +0000 INFO CMSlave - bid=ib_threatdb_a~1850~E7B1F3CE-FE08-454D-B41D-ED0346DE3671 src=05035B22-ECA4-4514-96AC-BE3BDF626D84 tgt=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 failing=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 queued replication error job
Reporting 1 search head logs:
04-29-2020 11:13:09.778 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.XXX.23:9997 timed out
04-29-2020 11:13:39.941 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.XXX.23:9997 timed out
04-29-2020 11:14:39.859 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.XXX.23:9997 timed out
Continuously seeing these logs.
Reporting 2 indexer logs:
04-29-2020 11:29:36.254 +0000 ERROR TcpInputProc - event=replicationData status=failed err="Could not open file for bid=ib_threatdb_a~29~05035B22-ECA4-4514-96AC-BE3BDF626D84 err="Cannot find config for idx=ib_threatdb_a" (No such file or directory)"
04-29-2020 11:29:36.257 +0000 ERROR TcpInputProc - event=replicationData status=failed err="Could not open file for bid=ib_threatdb_a~30~05035B22-ECA4-4514-96AC-BE3BDF626D84 err="Cannot find config for idx=ib_threatdb_a" (No such file or directory)"
04-29-2020 11:29:38.089 +0000 INFO ClusterMasterPeerHandler - master is not enabled on this node
04-29-2020 11:29:40.612 +0000 INFO ClusterMasterPeerHandler - master is not enabled on this node
04-29-2020 11:29:44.279 +0000 ERROR TcpInputProc - event=replicationData status=failed err="**Could not open file for bid=ib_threatdb_a~31~05035B22-ECA4-4514-96AC-BE3BDF626D84 err="Cannot find config for idx=ib_threatdb_a" (Success)"**
04-29-2020 11:29:44.938 +0000 INFO ClusterMasterPeerHandler - master is not enabled on this node
04-29-2020 11:29:45.488 +0000 INFO ClusterMasterPeerHandler - master is not enabled on this node
04-29-2020 11:29:51.481 +0000 INFO ClusterMasterPeerHandler - master is not enabled on this node
04-29-2020 11:29:52.300 +0000 ERROR TcpInputProc - event=replicationData status=failed err="Could not open file for bid=ib_threatdb_a~32~05035B22-ECA4-4514-96AC-BE3BDF626D84 err="Cannot find config for idx=ib_threatdb_a" (Success)"
04-29-2020 11:29:53.124 +0000 INFO ClusterMasterPeerHandler - master is not enabled on this node
Reporting 2 search head logs:
04-29-2020 10:53:19.465 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out
04-29-2020 10:53:49.467 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out
04-29-2020 10:54:19.245 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out
04-29-2020 10:54:49.245 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out
04-29-2020 10:56:49.243 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out
04-29-2020 10:57:19.244 +0000 WARN TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out
04-29-2020 10:57:29.528 +0000 FATAL ProcessRunner - Unexpected EOF from process runner child!
04-29-2020 10:57:29.528 +0000 ERROR ProcessRunner - helper process seems to have died (child killed by signal 15: Terminated)!
We happen to contact support for this and after analyzing we found that this had nothing to with splunk as in , it happens when after a upgrade of node certain stanza of indexes.conf configuration is lost ,this can happen due to app being restored from previous version an newer stanzas are lost. So bottom line was configuration restore should be handled judiciously in your deployment during an upgrade ,otherwise sf/rf could go for a toss.
Please let me know your views guys any suggestions would be of great help.Tried several thing by googling and other relevant splunk answers suggestions,still couldn't find the issue.This is kind of deadlock for us.
We happen to contact support for this and after analyzing we found that this had nothing to with splunk as in , it happens when after a upgrade of node certain stanza of indexes.conf configuration is lost ,this can happen due to app being restored from previous version an newer stanzas are lost. So bottom line was configuration restore should be handled judiciously in your deployment during an upgrade ,otherwise sf/rf could go for a toss.