To gracefully shutdown/restart the Trilio cluster the following steps are recommended.
It is recommended to verify that no snapshots or restores are running on the Trilio Cluster.
Stopping or restarting the Trilio cluster will cancel all running actively running backup or restore jobs. These jobs will be marked as errored after the system has come up again.
This can be verified using the following two commands:
The Trilio cluster is using the pacemaker service for setting the VIP(s) of the cluster and controlling the active node for the wlm-cron service. The identified node will be the last to shut down in case that the whole cluster gets shut down.
This can be checked using the following command:
In the following example is the master node the tvm1
A single node in the cluster can be shut down or restarted without issues. All services will come up and the RabbitMQ and Galeera service will rejoin the remaining cluster.
When the master node gets shutdown or restarted the VIP(s) and the wlm-cron service will switch to one of the remaining cluster nodes.
To speed up the shutdown/restart process it is recommended to stop the Trilio services, the RabbitMQ service, and the MariaDB service on the node.
The wlm-cron service and the VIP(s) are not getting stopped when only the master node gets rebooted or shut down. The pacemaker will automatically move the wlm-cron service and the VIP(s) to one of the remaining nodes.
After the services have been stopped the node can be restarted or shut down using standard Linux commands.
Restarting the whole cluster node by node follows the same procedure as restarting a single node, with the difference that each restarted node needs to be fully started again before the next node can be restarted.
When the complete cluster needs to get stopped and restarted at the same time the following procedure needs to be completed.
The procedure on a high level is:
Shutdown the two slave nodes
Shutdown the master node
Start the master node
Enable the Galeera cluster
Start the two slave nodes
Before shutting down the two slave nodes it is recommended to stop running Trilio services, the RabbitMQ server, and the MariaDB on the nodes.
Afterward, the nodes can be shut down.
Before shutting down the master node it is recommended to stop running Trilio services, the RabbitmQ server, the MariaDB, the wlm-cron and the VIP(s) resource in Pacemaker.
Afterward, the node can be shut down.
The first server that is getting booted will be the master node. It is highly recommended that the old master node will be booted first again.
Not booting the old mater node first again can lead to data loss when the Galeera Cluster is restarted.
Login into the freshly started master node and run the following command. This will restart the Galeera cluster with this node as master.
After the master node has been booted and the Galeera cluster started the remaining nodes can be started and will automatically rejoin the Trilio cluster.