In complex environments it is sometimes necessary to restart a single service or the complete solution. Rarely is restarting the complete node, where a service is running possible or even the ideal solution.
This page describes the services running by Trilio and how to restart those.
The Trilio Appliance is the controller of Trilio. Most services on the Appliance are running in a High Availability mode on a 3-node cluster.
The wlm-api service takes the API calls against the Trilio Appliance. It is running in active-active mode on all nodes of the Trilio cluster.
To restart the wlm-api service run on each Trilio node:
The wlm-scheduler service is taking job requests and identifies which Trilio node should take the request. It is running in active-active mode on all nodes of the Trilio cluster.
To restart the wlm-scheduler service run on each Trilio node:
The wlm-workloads service is the task worker of Trilio executing all jobs given to the Trilio node. It is running in active-active mode on all nodes of the Trilio cluster.
To restart the wlm-workloads service run on each Trilio node:
The wlm-cron service is responsible for starting scheduled Backups according to the configurtation of Tenant Workloads. It is running in active-passive mode and controlled by the pacemaker cluster.
To restart the wlm-workloads service run on the Trilio node with VIP assigned:
The Trilio appliance is running 1 to 4 virtual IPs on the Trilio cluster. These are controlled by the pacemaker cluster and provided through NGINX.
To restart these resources the pacemaker NGINX resource is getting restarted:
The Trilio cluster is using RabbitMQ as messaging service. It is running in active-active mode on all nodes of the Trilio cluster.
RabbitMQ is a complex system in itself. This guide will only provide the basic commands to do a restart of a node and check the health of the cluster afterward. For complete documentation of how to restart RabbitMQ, please follow the official RabbitMQ documentation.
To restart a RabbitMQ node run on each Trilio node:
It is recommended to wait for the node to rejoin and sync with the cluster before restarting another RabbitMQ node.
When the complete cluster is getting stopped and restarted it is important to keep the order of nodes in mind. The last node to be stopped needs to be the first node to be started.
The Galera Cluster is managing the Trilio MariaDB database. It is running in active-active mode on all nodes of the Trilio cluster.
Galera Cluster is a complex system in itself. This guide will only provide the basic commands to do a restart of a node and check the health of the cluster afterward. For complete documentation of how to restart Galera clusters, please follow the official Galera documentation.
When restarting Galera two different use-cases need to be considered:
Restarting a single node
Restarting the whole cluster
A single node can be restarted without any issues. It will automatically rejoin the cluster and sync against the remaining nodes.
The following commands will gracefully stop and restart the mysqld service.
After a restart will the cluster start the syncing process. Don't restart node after node to reach a complete cluster restart.
Check the cluster health after the restart.
Restarting a complete cluster requires some additional steps as the Galera cluster is basically destroyed once all nodes have been shut down. It needs to be rebuild afterwards.
First gracefully shutdown the Galera cluster on all nodes:
The second step is to identify the Galera node with the latest dataset. This can be achieved by reading the grastate.dat
file on the Trilio nodes.
When this documentation is followed the last mysqld service that got shut down will be the one with the latest dataset.
The value to check for are the seqno
.
The node with the highest seqno is the node that contains the latest data. This node will also contain safe_to_bootstrap: 1
to indicate that the Galera cluster can be rebuild from this node.
On the identified node the new cluster is getting generated with the following command:
Running galera_new_cluster on the wrong node will lead to data loss as this command will set the node the command is issued on as the first node of the cluster. All nodes which join afterward will sync against the data of this first node.
After the command has been issued is the mysqld service running on this node. Now the other nodes can be restarted one by one. The started nodes will automatically rejoin the cluster and sync against the master node. Once a synced status has been reached is each node a primary node in the cluster.
Check the Cluster health after all services are up again.
Verify the cluster health by running the following commands inside each Trilio MariaDB. The values returned from these statements have to be the same for each node.
Canonical Openstack is not using the Trilio Appliance. In Canonical environments is the Trilio controller unit part of the JuJu deployment as workloadmgr container.
To restart the services inside this container the following commands are to be issued.
On all nodes:
On a single node:
The Trilio dmapi service is running on the Openstack controller nodes. Depending on the Openstack Distribution Trilio is installed on different commands are issued to restart the dmapi service.
RHOSP13 is running the Trilio services as docker containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.
RHOSP16 is running the Trilio services as docker containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.
Canonical is running the Trilio services in JuJu controlled LXD containers. The dmapi service can be restarted by issuing the following command from the MASS node.
Kolla-Ansible Openstack is running the Trilio services as docker containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.
Ansible Openstack is running the Trilio services as LXD containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.
The Trilio datamover service is running on the Openstack compute nodes. Depending on the Openstack Distribution Trilio is installed on different commands are issued to restart the datamover service.
RHOSP13 is running the Trilio services as docker containers. The datamover service can be restarted by issuing the following command on the compute node.
RHOSP16 is running the Trilio services as docker containers. The datamover service can be restarted by issuing the following command on the compute node.
Canonical is running the Trilio services in JuJu controlled LXD containers. The datamover service can be restarted by issuing the following command from the MASS node.
Kolla-Ansible Openstack is running the Trilio services as docker containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.
Ansible Openstack is running the Trilio datamover service directly on the compute node. The datamover service can be restarted by issuing the following command on.