How long are we talking on the outages? Is this a flaky network connection from your HFs to Splunk Cloud? Is this a few minutes or are we talking hours? How many events would need to be queued to keep this from failing (HFs deal in event count)
If it's not too long, I think you are on the right path with your queues, but it needs to be at the other end of the parsing. Outputs.conf maxQueueSize on the HF can reach a large number of events. This is, of course, resource intensive, but why else do you have HFs being the funnel if they aren't there to be used.
Now, theoretically if your HF queues all fill, then it's parsing queues fill, and it backs onto your forwarders as the HF refuses to accept the data.
All this said, what is happening with your connection to Splunk Cloud that this is a big concern? I'd be checking into fixing that (if possible).
This is just part of the risk planning for cloud migration.
There are HFs as an intermediate forwarder to collect data from on-prem forwarders and send it to Splunk Cloud.
So essentially, after the queues are filled HF will stop accepting data, does that mean UF also wont send any data. Assuming once the link is up between HF to Cloud, then UF will start sending data from the last read location and HF will start accepting data. I believe, this wont result in any data loss ?
Update: to answer your first query, we are looking to support atleast 8 - 24 hour downtime period. In that case, what sort of solution we should look at ?