Hello, we had a multiday outage regarding the connectivity between the UFs and the IDXs. This affected the ability of all the UFs (5k or so) from sending logs to Splunk from our Windows servers. Once that connectivity was restored, for reasons yet to be determined, the UFs did not backfill, but kept sending current data. What I'm saying is, the UFs for some reason did not realized that they could not send data and did not pause in their transmission. Thus, we have about a 22 hour gap in our windows logs. We are trying to figure out how to get Splunk to re-ingest that data. All the searches I have found for re-ingestion of windows logs talk about deleting the checkpoint file for the time period and restarting Splunk. That would work for one or a few servers, but we need to do that at scale.
It seems the options for re-ingestion past data at scale are limited to:
1. Use something like SCCM to script the stop of Splunk UF, deletion of checkpoint files, and restart Splunk UF
2. Use something like SCCM to completely uninstall Splunk UF and reinstall with a inputs.conf that covers the missing timeframe, but realize we will duplicate everything after that.
Is there another option?
Thanks
What I have found so far, but seems like it would only work for a few servers, not 5k
1. Good point. I should have been clearer in that I was hoping someone else had gone through this and could in general describe what they had done.
3. Noted.
4. Noted.
5. Noted.
Appreciate the feedback.
It's not that easy.
Firstly, we have no idea what your configuration is (mostly inputs and outputs are of interest here)
Secondly, there is a very good question why didn't the forwarders stop and wait.
Thirdly, generally there is no native way to manipulate forwarder's internal state from remote. There are some ugly hacks to do it but I will not promote them here since it's very easy to shoot yourself in the foot that way.
Fourthly, with a normal desktop edition of windows 22 hours should not produce too many logs but on a busy server, depending on your configuration, that data could already have been overwritten if you hit the size limit.
Fifthly, if you do have the data on the other hand, removing checkpoints would mean rereading all available events from scratch so that could cause an overload of your license and/or infrastructure.