Solved: Replications not happening for certain buckets (ve...

satyamm · ‎04-29-2020

We have a single-site indexer cluster with 2 indexers and one cluster master.
We are seeing some issues related to one of the indexers which are not getting replicated after a reboot of the search head.

Splunk settings and conditions:

Splunk Version: 6.3.1
SF/RF are not met
Clustering: single-site
Each indexer and master has 12 cores and sufficient memory of 1TB
Ulimit = 102400
TH is disabled

/opt/splunk/bin/splunk show cluster-status

Replication factor not met
Search factor not met
All data is searchable
Indexing Ready = YES

reporting2.com-2-slave 05035B22-ECA4-4514-96AC-BE3BDF626D84 default
Searchable YES
Status Up
Bucket Count=713

reporting1.com-1-slave E7B1F3CE-FE08-454D-B41D-ED0346DE3671 default
Searchable YES
Status Up
Bucket Count=686

Telnet to both the indexers shows:
Connected to 10.XXX.XX

Things I tried:

Rebooting both of the indexers together.
Tuning some parameters in server.conf as

heartbeat_timeouot=600 onCM
heartbeat_period = 10 on peers.alt text

Haven't done any reporting data restore.
The data was freshly indexed.

Cluster master logs:

id=ib_threatdb_a~33~05035B22-ECA4-4514-96AC-BE3BDF626D84 tgtGuid=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 tgtHP=169.XXX.0.4:7089 tgtRP=7887 useSSL=false

04-29-2020 11:02:18.094 +0000 INFO CMMaster - replication error src=05035B22-ECA4-4514-96AC-BE3BDF626D84 tgt=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 failing=tgt bid=ib_threatdb_a~33~05035B22-ECA4-4514-96AC-BE3BDF626D84

04-29-2020 11:02:18.094 +0000 INFO CMReplicationRegistry - Finished replication: bid=ib_threatdb_a~33~05035B22-ECA4-4514-96AC-BE3BDF626D84 src=05035B22-ECA4-4514-96AC-BE3BDF626D84 target=E7B1F3CE-FE08-454D-B41D-ED0346DE3671

04-29-2020 11:02:18.094 +0000 INFO CMPeer - peer=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 peer_name=reporting1.com-1-slave transitioning from=Up to=Pending reason="non-streaming failure"

04-29-2020 11:02:18.094 +0000 INFO CMMaster - event=handleReplicationError bid=ib_threatdb_a~33~05035B22-ECA4-4514-96AC-BE3BDF626D84 tgt=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 peer_name=reporting1.com-1-slave msg='target doesn't have bucket now. ignoring'

After some time starting to see these logs getting flooded:

04-28-2020 15:43:08.302 +0000 INFO CMPeer - peer=05035B22-ECA4-4514-96AC-BE3BDF626D84 peer_name=reporting2.com-2-slave transitioning from=Pending to=Up reason="heartbeat received."

Reporting 1 indexer logs:

04-29-2020 14:03:17.588 +0**000 ERROR RawdataHashMarkReader - Error parsing rawdata inside bucket** path="/opt/splunk/var/lib/splunk/ib_threatdb_a/db/rb_1588024289_1588024289_1848_E7B1F3CE-FE08-454D-B41D-ED0346DE3671": msg="Bad opcode: 2B"

04-29-2020 14:03:17.588 +0000 INFO  BucketReplicator - Created asyncReplication task to replicate bucket ib_threatdb_a~1848~E7B1F3CE-FE08-454D-B41D-ED0346DE3671 to guid=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 host=169.254.0.4 s2sport=7887 bid=ib_threatdb_a~1848~E7B1F3CE-FE08-454D-B41D-ED0346DE3671

04-29-2020 14:03:17.588 +0000 INFO  BucketReplicator - event=startBucketReplication bid=ib_threatdb_a~1848~E7B1F3CE-FE08-454D-B41D-ED0346DE3671

04-29-2020 14:03:17.588 +0000 WARN  BucketReplicator - **Failed to replicate warm bucket bid=ib_threatdb_a~1850~E7B1F3CE-FE08-454D-B41D-ED0346DE3671 to guid=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 host=169.254.0.4 s2sport=7887.** Connection closed.

04-29-2020 14:03:17.588 +0000 INFO  CMReplicationRegistry - Finished replication: bid=ib_threatdb_a~1850~E7B1F3CE-FE08-454D-B41D-ED0346DE3671 src=05035B22-ECA4-4514-96AC-BE3BDF626D84 target=E7B1F3CE-FE08-454D-B41D-ED0346DE3671

04-29-2020 14:03:17.588 +0000 INFO  CMSlave - bid=ib_threatdb_a~1850~E7B1F3CE-FE08-454D-B41D-ED0346DE3671 src=05035B22-ECA4-4514-96AC-BE3BDF626D84 tgt=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 failing=E7B1F3CE-FE08-454D-B41D-ED0346DE3671 queued replication error job

Reporting 1 search head logs:

04-29-2020 11:13:09.778 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.XXX.23:9997 timed out

04-29-2020 11:13:39.941 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.XXX.23:9997 timed out

04-29-2020 11:14:39.859 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.XXX.23:9997 timed out

Continuously seeing these logs.

Reporting 2 indexer logs:

04-29-2020 11:29:36.254 +0000 ERROR TcpInputProc - event=replicationData status=failed err="Could not open file for bid=ib_threatdb_a~29~05035B22-ECA4-4514-96AC-BE3BDF626D84 err="Cannot find config for idx=ib_threatdb_a" (No such file or directory)"

04-29-2020 11:29:36.257 +0000 ERROR TcpInputProc - event=replicationData status=failed err="Could not open file for bid=ib_threatdb_a~30~05035B22-ECA4-4514-96AC-BE3BDF626D84 err="Cannot find config for idx=ib_threatdb_a" (No such file or directory)"

04-29-2020 11:29:38.089 +0000 INFO  ClusterMasterPeerHandler - master is not enabled on this node

04-29-2020 11:29:40.612 +0000 INFO  ClusterMasterPeerHandler - master is not enabled on this node

04-29-2020 11:29:44.279 +0000 ERROR TcpInputProc - event=replicationData status=failed err="**Could not open file for bid=ib_threatdb_a~31~05035B22-ECA4-4514-96AC-BE3BDF626D84 err="Cannot find config for idx=ib_threatdb_a" (Success)"**

04-29-2020 11:29:44.938 +0000 INFO  ClusterMasterPeerHandler - master is not enabled on this node

04-29-2020 11:29:45.488 +0000 INFO  ClusterMasterPeerHandler - master is not enabled on this node

04-29-2020 11:29:51.481 +0000 INFO  ClusterMasterPeerHandler - master is not enabled on this node

04-29-2020 11:29:52.300 +0000 ERROR TcpInputProc - event=replicationData status=failed err="Could not open file for bid=ib_threatdb_a~32~05035B22-ECA4-4514-96AC-BE3BDF626D84 err="Cannot find config for idx=ib_threatdb_a" (Success)"

04-29-2020 11:29:53.124 +0000 INFO  ClusterMasterPeerHandler - master is not enabled on this node

Reporting 2 search head logs:

04-29-2020 10:53:19.465 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out

04-29-2020 10:53:49.467 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out

04-29-2020 10:54:19.245 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out

04-29-2020 10:54:49.245 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out

04-29-2020 10:56:49.243 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out

04-29-2020 10:57:19.244 +0000 WARN  TcpOutputProc - Cooked connection to ip=10.196.107.28:9997 timed out

04-29-2020 10:57:29.528 +0000 FATAL ProcessRunner - Unexpected EOF from process runner child!

04-29-2020 10:57:29.528 +0000 ERROR ProcessRunner - helper process seems to have died (child killed by signal 15: Terminated)!

satyamm · ‎06-02-2021

We happen to contact support for this and after analyzing we found that this had nothing to with splunk as in , it happens when after a upgrade of node certain stanza of indexes.conf configuration is lost ,this can happen due to app being restored from previous version an newer stanzas are lost. So bottom line was configuration restore should be handled judiciously in your deployment during an upgrade ,otherwise sf/rf could go for a toss.

View solution in original post

satyamm · ‎04-29-2020

Please let me know your views guys any suggestions would be of great help.Tried several thing by googling and other relevant splunk answers suggestions,still couldn't find the issue.This is kind of deadlock for us.

satyamm · ‎06-02-2021

We happen to contact support for this and after analyzing we found that this had nothing to with splunk as in , it happens when after a upgrade of node certain stanza of indexes.conf configuration is lost ,this can happen due to app being restored from previous version an newer stanzas are lost. So bottom line was configuration restore should be handled judiciously in your deployment during an upgrade ,otherwise sf/rf could go for a toss.

Replications not happening for certain buckets (version 6.3.1).

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

SignalFlow: What? Why? How?