Deployment Architecture

How to avoid data loss at indexer restart?

kaurinko
Communicator

Hi!

I have an environment with a single indexer+srchead and approximately 100 UFs sending logs to it. Some of them are quite active log producers. I have realized that invoking a splunk restart to activate a config change will usually cause data loss for a period of 1-2 minutes for some of the sources, which is something  I would rather avoid. It should be doable by stopping the UFs first, then restarting the indexer and finally starting the UFs again.

Is there another way?

I figure the data originating from log files monitored by UFs gets sent to the indexer, but the poor indexers don't know the indexer is unable to process the last bits and the UF thinks the data was properly received and processed.

I read about Persistent Queues, but that didn't quite seem to solve this issue either.

Any suggestions?

Labels (1)
0 Karma
1 Solution

richgalloway
SplunkTrust
SplunkTrust

UF should buffer events internally until the indexer is back up, but a data could exceed that if the volume is high enough. Persistent queues would help.  How are they not working for you?

Consider turning on indexer acknowledgment so the  UF knows the data has been received.

A better option would be to set up a second indexer so the UFs always have an indexer to send to.

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway
SplunkTrust
SplunkTrust

UF should buffer events internally until the indexer is back up, but a data could exceed that if the volume is high enough. Persistent queues would help.  How are they not working for you?

Consider turning on indexer acknowledgment so the  UF knows the data has been received.

A better option would be to set up a second indexer so the UFs always have an indexer to send to.

---
If this reply helps you, Karma would be appreciated.

kaurinko
Communicator

Hi,

We just started using useACK=true in all the essential UFs, and it seems like the problem is solved. In any case possibly receiving some events twice is a lesser problem than not receiving them at all.

Thanks everybody for your help!

kaurinko
Communicator

Hi @richgalloway ,

Thanks for your reply!

I didn't try persistent queues. I just read what the linked page says: Persistent queues are not available for monitor inputs, which tends to rule this solution out, as all my input are monitor inputs.

I also thought about acknowledgments, but then I found a thread discussing this option, and somebody mentioned the problems of this approach being some data being sent twice. It seems I can't find that thread right now to give a link. Anyway, receiving the same data twice is not exactly what I would like either.

The option of a second indexer may be the way to go, even though I am sure that will give us a lot of new headache. I have no experiences of such an environment, and I guess we would need to have all the data duplicated on both of the indexers. The current system is really an all in one and the performance has been sufficient. It is just this problem with maintenance reboots giving us a hard time.

I will have to collect some ideas and then decide what to do.

Best regards,

Petri

0 Karma

gcusello
SplunkTrust
SplunkTrust

HI @kaurinko,

Universal Forwarders, when the Indexer is unavailable for some reason, make a local cache of the data, so you don't lose any data during Indexer down.

When the Indexer will be again available the data will be sent to it from the Forwarders.

It's different if you have syslogs, in this case you have to implemente a redundant architecture with two HFs and a Load Balancer to manage eventual faults.

Ciao.

Giuseppe

0 Karma

kaurinko
Communicator

Hi @gcusello ,

Thanks for your reply!

The behaviour you described for UFs during sudden disappearance of an indexer is just what I had expected, but for some reason it doesn't seem to work that way. I haven't worked with syslog inputs that much, mostly monitoring log files. One would assume the UF would realize when the indexer stops listening and continue when it is back up again, but that is not happening. As @richgalloway suggested, I might try acks to increase reliability of the data transfer, but I was hoping there was a neat way of signaling to the UFs that the indexer is going down, please wait for a short while, or something. With monitoring file-inputs, that should be easy.

Best regards,

Petri

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @kaurinko,

in my experience (11 years) local cache has been always work, I didn't experienced data loss for a short Indexer unavailability.

It's a different thing if your Indexer isn't available for many days, but never for few minutes or also few hours.

Open a Case to Splunk Support if you encountered a condition when the local cache doesn't work, it could be a bug.

Ciao.

Giuseppe

kaurinko
Communicator

Hi @gcusello ,

Our experiences are similar to yours. We have been running our Splunk-installation for 9-10 years now, and only recently have we noticed these problems. In fact, it used to work so reliably, that we never had to bother about the survival of the UFs. Of course in the very beginning the installation was very small with only a few UFs, but the current volume has been around for something like 5-6 years now. Yet only recently we discovered this loss of data even from low volume data streams.

I could take a look at the history, if I can pinpoint a restart sequence without loss of data. The problem is, that most probably I do not necessarily have splunkd.log stored for sufficiently long time. My gut feeling is we didn't have these problems with Splunk v. 8, but we upgraded to v. 9 in June. In any case, my reliance on Splunk support actually doing anything useful for this is zero even if I could come to the conclusion, that something really has changed in the behaviour.

I'll see what I can find from the logs, and just maybe I'll file a ticket. Right now I don't have a case splunk support would bother to spend any time with.

Best regards,

Petri

0 Karma
Get Updates on the Splunk Community!

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...

Explore the Latest Educational Offerings from Splunk [January 2025 Updates]

At Splunk Education, we are committed to providing a robust learning experience for all users, regardless of ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...