Shutdown/Restart the Trilio cluster
To gracefully shutdown/restart the Trilio cluster the following steps are recommended.
Verify no snapshots or restores are running
It is recommended to verify that no snapshots or restores are running on the Trilio Cluster.
Stopping or restarting the Trilio cluster will cancel all running actively running backup or restore jobs. These jobs will be marked as errored after the system has come up again.
This can be verified using the following two commands:
workloadmgr snapshot-list --all=True
workloadmgr restore-list
Identify the master node for the VIP(s) and wlm-cron service
The Trilio cluster is using the pacemaker service for setting the VIP(s) of the cluster and controlling the active node for the wlm-cron service. The identified node will be the last to shut down in case that the whole cluster gets shut down.
This can be checked using the following command:
pcs status
In the following example is the master node the tvm1
pcs status
Cluster name: triliovault
WARNINGS:
Corosync and pacemaker node names do not match (IPs used in setup?)
Stack: corosync
Current DC: tvm3 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Thu Aug 26 12:10:32 2021
Last change: Thu Aug 26 08:02:51 2021 by root via crm_resource on tvm1
3 nodes configured
8 resource instances configured
Online: [ tvm1 tvm2 tvm3 ]
Full list of resources:
virtual_ip (ocf::heartbeat:IPaddr2): Started tvm1
virtual_ip_public (ocf::heartbeat:IPaddr2): Started tvm1
virtual_ip_admin (ocf::heartbeat:IPaddr2): Started tvm1
virtual_ip_internal (ocf::heartbeat:IPaddr2): Started tvm1
wlm-cron (systemd:wlm-cron): Started tvm1
Clone Set: lb_nginx-clone [lb_nginx]
Started: [ tvm1 ]
Stopped: [ tvm2 tvm3 ]
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Shutdown/Restart of a single node in the cluster
A single node in the cluster can be shut down or restarted without issues. All services will come up and the RabbitMQ and Galeera service will rejoin the remaining cluster.
When the master node gets shutdown or restarted the VIP(s) and the wlm-cron service will switch to one of the remaining cluster nodes.
Stop the services on the node
To speed up the shutdown/restart process it is recommended to stop the Trilio services, the RabbitMQ service, and the MariaDB service on the node.
systemctl stop wlm-api
systemctl stop wlm-scheduler
systemctl stop wlm-workloads
systemctl stop mysqld
rabbitmqctl stop
Shutdown/Restart the node
After the services have been stopped the node can be restarted or shut down using standard Linux commands.
reboot
shutdown
Restarting the complete cluster node by node
Restarting the whole cluster node by node follows the same procedure as restarting a single node, with the difference that each restarted node needs to be fully started again before the next node can be restarted.
Shutdown/Restart the complete cluster as a whole
When the complete cluster needs to get stopped and restarted at the same time the following procedure needs to be completed.
The procedure on a high level is:
Shutdown the two slave nodes
Shutdown the master node
Start the master node
Enable the Galeera cluster
Start the two slave nodes
Shutdown the two slave nodes
Before shutting down the two slave nodes it is recommended to stop running Trilio services, the RabbitMQ server, and the MariaDB on the nodes.
systemctl stop wlm-api
systemctl stop wlm-scheduler
systemctl stop wlm-workloads
systemctl stop mysqld
rabbitmqctl stop
Afterward, the nodes can be shut down.
shutdown
Shutdown the master node
Before shutting down the master node it is recommended to stop running Trilio services, the RabbitmQ server, the MariaDB, the wlm-cron and the VIP(s) resource in Pacemaker.
systemctl stop wlm-api
systemctl stop wlm-scheduler
systemctl stop wlm-workloads
systemctl stop mysqld
rabbitmqctl stop
pcs resource stop wlm-cron
pcs resource stop lb_nginx-clone
Afterward, the node can be shut down.
shutdown
Start the master node
The first server that is getting booted will be the master node. It is highly recommended that the old master node will be booted first again.
Enable the Galeera cluster
Login into the freshly started master node and run the following command. This will restart the Galeera cluster with this node as master.
galera_new_cluster
Start the slave nodes
After the master node has been booted and the Galeera cluster started the remaining nodes can be started and will automatically rejoin the Trilio cluster.
Last updated
Was this helpful?