If Dynatrace Managed Cluster upgrade fails, will it roll back automatically?
Or is there any other recovery process?
Our customer is planning to provide Dynatrace Managed to their customers as a service.
So they want to minimize Dynatrace's downtime as much as possible.
Answer by Radoslaw S. ·
Copy of answer from :
During upgrade all nodes are being shutdown for the moment of upgrading binaries - so the downtime is expected. It works like that:
1. We shutdown all nodes.
2. Upgrade one by one.
3. Once upgrade is done, we start all nodes one by one.
As mentioned it takes about 10 minutes until the first node starts, usually. All operation in normal situation can take up to 30 minutes. This strongly depends on the speed of disk, network operations and load.
Answering directly your questions - yes downtime is expected even for multi-node cluster.
FYI With version 148 we will start supporting rolling upgrades - which means zero downtime, as only one node is being upgraded at the same time. Mission Control team will control if upgrade is going to be performed in rolling fashion or full upgrade with downtime. You'll be informed on the mode with the e-mail notification before the upgrade.
Regarding the recovery, it is as following:
1. Whenever it's possible we try to roll-back to the previous version.
2. In case it's not possible or unexpected crash happens (worst scenario) - there's a restore from backup possible.
3. Even if upgrade fails - data is kept.
4. Mission Control team proactively inspects all upgrade processes and acts on each failure to bring cluster up as soon as possible.
Can I answer any other questions ?
Answer by Gautier B. ·
It really depends of the failure. In much cases (90%); because you are using a cluster of at least 3 nodes, the end user doesn't see any trouble. It's rare that the upgrade failes on every nodes on the same element (Cassandra, ElasticSearch etc.) in the same time.
Answer by 野愛 小. ·
Thank you for your response.
But I can not find any recovery plan in your post.
> It's rare that the upgrade failes on every nodes on the same element (Cassandra, ElasticSearch etc.) in the same time.
I understood it is the very rare case. But our customer considers that.
Could you tell me whether you have recovery plan when the cluster crushes by the updating?
I think the timing of updating of the cluster nodes is done at same time.
I've got the answer following post about that. https://answers.dynatrace.com/spaces/482/dynatrac...
I think that if something wrong is included in the update file, it is possible to stop all of the nodes.