We have a question regarding a specific use case of data forwarding, we would like to know if there is a risk with the situation.
Let's say we have two Splunk platforms, one has a set of indexers (indexer1) that stores data for a specific use case, the other platform has also its indexer (indexer2) for another use case. At some point, these 2 platforms will have to collect data from the same machines, not necessarily the same data, but it could be.
So there is a forwarder on a remote machine, let's say we have to collect the same file and forward it to both platforms. We are using the data cloning technique, with for example:
[monitor://file_path] index = … sourcetype = ... _TCP_ROUTING=indexer1,indexer2
Or with outputs.conf
Or by using props/transforms to alter the routing of the events on an intermediate Heavy Forwarder, whatever (please, do tell if there is a significant difference with any of these methods for this situation).
Now the question is: In general, is there a risk that one of the two platforms will stop receiving events if the other is down, depending on the configuration of the indexers/forwarders ?
We have heard of the concept of indexer acknowledgment, we are not sure if it can have any impact on this situation. For example, if the group indexer1 is configured with acknowledgement enabled, is there any risk that the group indexer2 won't receive data when indexer1 is not acknowledging the reception ?
This topic is a little bit confusing for us, we have heard claims that the data forwarding could be blocked if another platform needs to receive the data and one of the indexer is down, but it doesn't seem right. We just want to clarify, with the set up described above, if there would be any issue.
Thank you very much for your help
We learned about the parameters in outputs.conf to consider when configuring the behavior of the queues:
- dropEventsOnQueueFull =
- dropClonedEventsOnQueueFull =
- blockOnCloning =
It is indeed possible to block the data collect of a splunk instance (in a data cloning configuration) when not paying attention to these parameters. In the little test we did, the default value of dropClonedEventsOnQueueFull made it so that the data collect didn't block. However we have to watch out for dropEventsOnQueueFull as well, which can cause data forwarding issues when a splunk instance is unavailable (with default value) => But it also depends on whether you accept the loss of data or not in your deployment.
Very interesting parameters to know about.
If a forwarder cannot send data to both indexers then it will not send the data at all. It will be queued until all destinations are available. It doesn't seem right, but it is.
Your comment is interesting, it might not answer my specific question but maybe it raises another issue that we didn't anticipated.
So wait, is this true even if the group indexer1 has acknowledgement enabled and group indexer2 has the default setting (acknowledgment disabled) ?
And you mean that if both groups are down, data forwarding will only restart when both groups are up (at least one indexer available in each group) ?
I would have thought that if the group indexer2 doesn't care about acknowledgement, data will still be forwarded to it anyway, even if the the group indexer1 is unavailable.
Giving a forwarder two outputs means the forwarder must send its data to two destinations. If either destination is blocked for any reason (no ACK, indexer is down, etc.) then the other destination is treated as though it is also blocked.
Are you sure about this ? Is this behavior documented somewhere ?
I just tested the following:
Configured a tcpout group that refers to a non existing indexer
Configured an input sending the data to a group of existing indexer (indexer) and to the dummy_output_group
index = main
sourcetype = test_data_forwarding
_TCP_ROUTING = indexer, dummy_output_group
I refreshed the configuration of the Splunk server, it took into account the new dummy output group. And the group indexer did index correctly all the events from the test file (even beyond the max queue size of the dummy output group that started to drop events).
Doesn't this test represent a case where one output group is unavailable ? Yet the data is collected by the other group.
Am I missing something ?
I have not seen this behavior documented anywhere. It's been passed to me by other Splunk admins. Your test does seem to contradict it, however.