Deployment Architecture

splunk index cluster + search head cluster upgrade 6.4.1 to 6.5. Holding forward data in a queue

bryanwiggins
Path Finder

Hi

environment (all linux OS based):
3x index cluster peers
1x cluster master
1x deployer/license master
3x search head cluster peers
2x heavy forwarders

Question:
I have been reading the following documentation for upgrading splunk - http://docs.splunk.com/Documentation/Splunk/6.5.0/Indexer/Upgradeacluster - I read from it that it is still not possible to perform rolling upgrades? A shame if that is true, as ELK is still a possible option for us and afaik you can run rolling upgrades on ELK clusters.

If it is the case that we cannot at this point do rolling upgrades on our splunk nodes, is there a preferred approach to how we queue data coming in from the heavy forwarders?

I have been reading the following documentation relating to forwarders 'wait queue' -http://docs.splunk.com/Documentation/Splunk/6.5.0/Forwarding/Protectagainstlossofin-flightdata - and my thinking is that I could increase the 'readTimeout' in the 'outputs.conf' to some arbitrary figure to cover the cluster upgrade process (will test in a lab first to get expected time). depending that we have enough disk capacity for the 'wait queue' is my thinking ok?

Also, does anybody know if splunk are planning to allow for rolling upgrades in the near-future, as i'm sure I wouldn't be the only one seeing this as more than desirable 🙂

Thx
Bry

Tags (1)
0 Karma
1 Solution

masonmorales
Influencer

Rolling upgrades are currently only supported for maintenance releases (e.g. v6.4.3 -> v6.4.4). I don't know what their roadmap is for extending rolling upgrade support, but I would expect they will.

Events will queue on the forwarders automatically in memory when the indexers are unreachable. If you think they will be unreachable for an extended period of time, you may want to enable persistent queues. See: https://docs.splunk.com/Documentation/Splunk/6.5.0/Data/Usepersistentqueues

View solution in original post

0 Karma

maciep
Champion

I'm pretty sure I submitted an enhancement request for this a few versions ago now. The fact that you have to completely take a cluster down to upgrade it really defeats the purpose of having a cluster.

That said, we still do rolling upgrades. We have devices that send thousands of events per second and we can't queue them long enough to upgrade our entire cluster. And even if we could, risking the loss of data isn't worth it.

I'd have to review our documentation, but I think we typically:

  1. upgrade the master
  2. enter maint mode
  3. upgrade the peers one-by-one (or many-by-many)
  4. upgrade the shc as defined in the docs
  5. upgrade the rest of the env
0 Karma

bryanwiggins
Path Finder

Hi maciep

Thanks for your pointers - I ran this up in the lab and it seemed to work, so I shall do some more testing whilst running some data loading (ran out of time to do a deep test).

I had hoped to accept multiple answers but I was unable to. I have awarded 10 points - hopefully that is ok as i'm not sure what the correct protocol is on this site for awarding points (happy to be informed 🙂 )

Thx
Bry

0 Karma

maciep
Champion

no problem, Bry. I never worried about points, just hope it works for you.

Also, we did just upgrade from 6.3.4 to 6.5.1 following this approach without issues. I'm not sure what the problems could be that require you to upgrade the entire cluster at once, but I don't think I ever run into any of them.

0 Karma

bryanwiggins
Path Finder

maciep

thank you for your update - very interesting. maybe splunk will make that approach formal for future version upgrades, as it's not the best suggesting offlining the cluster (of course, once multi-site not too much of an issue).

again, thank you!
Bry

0 Karma

bryanwiggins
Path Finder

hi maciep

thanks for your reply. i completely agree, a complete single-site cluster that is instructed to be switched off to perform an application upgrade is quite unsettling!

i'll run the steps in a lab to confirm the version upgrade, as that sounds similar to a recent process when I needed to update the splunk OS kernel's for 'dirty cow' vulnerability.

I'll report back on the test results.

Again, thanks for your response!

Thx
Bry

0 Karma

bryanwiggins
Path Finder

slight delay in running this up in my lab - scheduled for this week though. I will update my findings.

Thx
Bry

0 Karma

masonmorales
Influencer

Rolling upgrades are currently only supported for maintenance releases (e.g. v6.4.3 -> v6.4.4). I don't know what their roadmap is for extending rolling upgrade support, but I would expect they will.

Events will queue on the forwarders automatically in memory when the indexers are unreachable. If you think they will be unreachable for an extended period of time, you may want to enable persistent queues. See: https://docs.splunk.com/Documentation/Splunk/6.5.0/Data/Usepersistentqueues

0 Karma

bryanwiggins
Path Finder

hi masonmorales

thanks too for your response to my query. I like the 'persistent queues', thanks for that doc link. Similar to what I was reading but more targeted for our specific needs (tcp data inputs).

I shall run this up in the lab this week and report back on the results.

thanks again!
Bry

0 Karma

bryanwiggins
Path Finder

thanks masonmorales

I accepted this as the answer as it directly references how to deal with queued data on the heavies - I had tested a rolling upgrade (maciep) which seemed to work.

Thx
Bry

0 Karma

bryanwiggins
Path Finder

slight delay in running this up in my lab - scheduled for this week though. I will update my findings.

Thx
Bry

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...