I need to understand in detail how indexer acknowledgement works when it comes to cluster replication, specifically when the chain of acknowledgement is terminated and the forwarder is able to release it from memory. The point is to get a guarantee that no data (event) will be lost. At which point does it consider the data to be indexed (and send an acknowledgement back to the forwarder)?:
A) When the first indexer persists it to disk?
B) When the Cluster Master has finished replicating the data throughout the cluster?
The scenario is this, with indexer acknowledgement (useAck=true) set in all outputs.conf down the chain:
Via forwarding/outputs process: Universal Forwarder -> Heavy Forwarder -> Indexer
Via replication process: Indexer -> replication peer indexers
If the event has been persisted by the first Indexer (and thus an acknowledgement has gone back to the forwarder which then forgets the event), but this Indexer hard crashes (eg. unrecoverable disk corruption) before it is replicated to a peer, do you now have a a missing event?
If after the indexer acknowledges, the data integrity is then dependent on Splunk clustering (not indexer replication) ensuring that the above crash situation does not lead to data loss, then is cluster replication of every single written event guaranteed?
I've read the following and it does not cover this case:
I believe this section of the docs does answer your question:
If all goes well, the receiving peer:
receives the block of data, parses and indexes it, and writes the data (raw data and index data) to the file system.
streams copies of the raw data to each of its target peers.
sends an acknowledgment back to the forwarder.
The acknowledgment assures the forwarder that the data was successfully written to the cluster. Upon receiving the acknowledgment, the forwarder releases the block from memory.
In other words, the ack does not get sent back to the forwarder until the source peer (i.e., the one that receives the data from the forwarder) has replicated the data to its target peers. So, if the source indexer crashes before it replicates the data to the other peers, the forwarder will not get an acknowledgement.
You are quite right, that part of the doc does appear to be fairly clear about my case. Thanks for your answer.
What is sufficient to pass through #2? Will the Cluster Master wait for the replication factor to be met before sending an ACK to the forwarder? What about the search factor?
Consider a multi-site replication environment and the scenario where one site is down. The replication factor won't be met during the outage, and thus the forwarder will not receive an ACK and will stop forwarding data. So in a DR scenario, you won't get real-time data. Is that correct?
Does this still hold true for version 6.6.4 and later?