Background:
There are two types of ACKs in play here.
Scenario:
Right now I have a bunch of indexers split between two sites, none of which are clustered together. I would like to setup a multi-site indexing cluster, with SF=2, RF=4 (origin:2, site1: 2). I'll be turning on useACK on the forwarders so I don't lose any data. My team does Disaster Recovery testing, and I want to make sure that Splunk will still work (forwarders will get ACKs, indexers will index data, et al) during the DR test. The DR test itself will last 48 hours and consist of severing the links connecting the 2 data centers (siteA, siteB).
Splunk in both sites must continue to function properly during the test & when the links are brought back online.
The concern with this is that the indexer receiving the chunk of data from the forwarder MUST successfully complete RF -1 replications of the raw data to other peers before the ACK is sent back to the forwarders. With the WAN links disabled between the two sites, the best an indexer will ever be able to muster is 1 replicate, and will never get to the 2 replicates required to return the ACK to the forwarder. Thus, losing all visibility into what's going on in the environment, and this becomes a show stopper for rolling out multi-site indexer clustering.
What's going to happen?
As with most things in Splunk, there are timeouts governing how long the indexer will wait for an ACK from another peer. From server.conf:
rep_max_send_timeout = : Maximum send timeout for sending replication slice data between cluster nodes.
rep_max_rcv_timeout = : Maximum cumulative receive timeout for receiving acknowledgement data from peers.
The default for both of these is 600 seconds.
The receiving node sends the ACK to the forwarder when it gets notification from each of the target peers of either
1) successful write
2) unsuccessful write
Or, to put it slightly differently, replication success or replication failure. either way, the forwarder will receive an ACK and keep sending data to the indexers.
When the link is terminated, hot buckets will timeout after the defined period(s) above, and roll to become warm buckets. Buckets are still searchable, and Splunk continues to hum along.
Additional information:
When the WAN links are re-established, newly created buckets will attempt to replicate to the indexers on the other site. The behavior doesn't change while the connection is down, the indexer just fails to make a new connection for streaming_replication.
When buckets roll to warm due to replication timeouts during their hot life, fixup activity will continue even when there is a partition between sites. When connection is re-established, fixup activity will resume as the CM can now see both sites again and work towards fully satisfying RF & SF.
If the CM is on one of the sites and unable to communicate to the other site during this partitioning, the CM will consider the other site "down" and not attempt to fixup to the other site. Indexing in the other site will continue as if the CM has disappeared. Replications for new hot buckets will use the last set of targets.
I will add to this answer if further information about this scenario is needed.
Please be aware that some WAN optimizer products (like RiverBed) do some magic with ACKs. So you have to exclude Splunk from WAN optimization to prevent any weird behavior (also for Indexer <--> Forwarder)!
Just for information.
As with most things in Splunk, there are timeouts governing how long the indexer will wait for an ACK from another peer. From server.conf:
rep_max_send_timeout = : Maximum send timeout for sending replication slice data between cluster nodes.
rep_max_rcv_timeout = : Maximum cumulative receive timeout for receiving acknowledgement data from peers.
The default for both of these is 600 seconds.
The receiving node sends the ACK to the forwarder when it gets notification from each of the target peers of either
1) successful write
2) unsuccessful write
Or, to put it slightly differently, replication success or replication failure. either way, the forwarder will receive an ACK and keep sending data to the indexers.
When the link is terminated, hot buckets will timeout after the defined period(s) above, and roll to become warm buckets. Buckets are still searchable, and Splunk continues to hum along.
Additional information:
When the WAN links are re-established, newly created buckets will attempt to replicate to the indexers on the other site. The behavior doesn't change while the connection is down, the indexer just fails to make a new connection for streaming_replication.
When buckets roll to warm due to replication timeouts during their hot life, fixup activity will continue even when there is a partition between sites. When connection is re-established, fixup activity will resume as the CM can now see both sites again and work towards fully satisfying RF & SF.
If the CM is on one of the sites and unable to communicate to the other site during this partitioning, the CM will consider the other site "down" and not attempt to fixup to the other site. Indexing in the other site will continue as if the CM has disappeared. Replications for new hot buckets will use the last set of targets.
I will add to this answer if further information about this scenario is needed.