Shutdown/Restart the Trilio cluster

To gracefully shutdown/restart the Trilio cluster the following steps are recommended.

Verify no snapshots or restores are running

It is recommended to verify that no snapshots or restores are running on the Trilio Cluster.

Stopping or restarting the Trilio cluster will cancel all running actively running backup or restore jobs. These jobs will be marked as errored after the system has come up again.

This can be verified using the following two commands:

workloadmgr snapshot-list --all=True
workloadmgr restore-list

Identify the master node for the VIP(s) and wlm-cron service

The Trilio cluster is using the pacemaker service for setting the VIP(s) of the cluster and controlling the active node for the wlm-cron service. The identified node will be the last to shut down in case that the whole cluster gets shut down.

This can be checked using the following command:

pcs status

In the following example is the master node the tvm1

pcs status
Cluster name: triliovault

WARNINGS:
Corosync and pacemaker node names do not match (IPs used in setup?)

Stack: corosync
Current DC: tvm3 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Thu Aug 26 12:10:32 2021
Last change: Thu Aug 26 08:02:51 2021 by root via crm_resource on tvm1

3 nodes configured
8 resource instances configured

Online: [ tvm1 tvm2 tvm3 ]

Full list of resources:

 virtual_ip     (ocf::heartbeat:IPaddr2):       Started tvm1
 virtual_ip_public      (ocf::heartbeat:IPaddr2):       Started tvm1
 virtual_ip_admin       (ocf::heartbeat:IPaddr2):       Started tvm1
 virtual_ip_internal    (ocf::heartbeat:IPaddr2):       Started tvm1
 wlm-cron       (systemd:wlm-cron):     Started tvm1
 Clone Set: lb_nginx-clone [lb_nginx]
     Started: [ tvm1 ]
     Stopped: [ tvm2 tvm3 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Shutdown/Restart of a single node in the cluster

A single node in the cluster can be shut down or restarted without issues. All services will come up and the RabbitMQ and Galeera service will rejoin the remaining cluster.

When the master node gets shutdown or restarted the VIP(s) and the wlm-cron service will switch to one of the remaining cluster nodes.

Stop the services on the node

To speed up the shutdown/restart process it is recommended to stop the Trilio services, the RabbitMQ service, and the MariaDB service on the node.

The wlm-cron service and the VIP(s) are not getting stopped when only the master node gets rebooted or shut down. The pacemaker will automatically move the wlm-cron service and the VIP(s) to one of the remaining nodes.

systemctl stop wlm-api
systemctl stop wlm-scheduler
systemctl stop wlm-workloads
systemctl stop mysqld
rabbitmqctl stop

Shutdown/Restart the node

After the services have been stopped the node can be restarted or shut down using standard Linux commands.

reboot
shutdown

Restarting the complete cluster node by node

Restarting the whole cluster node by node follows the same procedure as restarting a single node, with the difference that each restarted node needs to be fully started again before the next node can be restarted.

Shutdown/Restart the complete cluster as a whole

When the complete cluster needs to get stopped and restarted at the same time the following procedure needs to be completed.

The procedure on a high level is:

  • Shutdown the two slave nodes

  • Shutdown the master node

  • Start the master node

  • Enable the Galeera cluster

  • Start the two slave nodes

Shutdown the two slave nodes

Before shutting down the two slave nodes it is recommended to stop running Trilio services, the RabbitMQ server, and the MariaDB on the nodes.

systemctl stop wlm-api
systemctl stop wlm-scheduler
systemctl stop wlm-workloads
systemctl stop mysqld
rabbitmqctl stop

Afterward, the nodes can be shut down.

shutdown

Shutdown the master node

Before shutting down the master node it is recommended to stop running Trilio services, the RabbitmQ server, the MariaDB, the wlm-cron and the VIP(s) resource in Pacemaker.

systemctl stop wlm-api
systemctl stop wlm-scheduler
systemctl stop wlm-workloads
systemctl stop mysqld
rabbitmqctl stop
pcs resource disable wlm-cron
pcs resource disable lb_nginx-clone

Afterward, the node can be shut down.

shutdown

Start the master node

The first server that is getting booted will be the master node. It is highly recommended that the old master node will be booted first again.

Not booting the old mater node first again can lead to data loss when the Galeera Cluster is restarted.

Enable the Galeera cluster

Login into the freshly started master node and run the following command. This will restart the Galeera cluster with this node as master.

galera_new_cluster

Start the slave nodes

After the master node has been booted and the Galeera cluster started the remaining nodes can be started and will automatically rejoin the Trilio cluster.

Last updated