Restart Trilio Services

In complex environments it is sometimes necessary to restart a single service or the complete solution. Rarely is restarting the complete node, where a service is running possible or even the ideal solution.

This page describes the services running by Trilio and how to restart those.

Trilio Appliance Services

The Trilio Appliance is the controller of Trilio. Most services on the Appliance are running in a High Availability mode on a 3-node cluster.

wlm-api

The wlm-api service takes the API calls against the Trilio Appliance. It is running in active-active mode on all nodes of the Trilio cluster.

To restart the wlm-api service run on each Trilio node:

systemctl restart wlm-api

wlm-scheduler

The wlm-scheduler service is taking job requests and identifies which Trilio node should take the request. It is running in active-active mode on all nodes of the Trilio cluster.

To restart the wlm-scheduler service run on each Trilio node:

systemctl restart wlm-scheduler

wlm-workloads

The wlm-workloads service is the task worker of Trilio executing all jobs given to the Trilio node. It is running in active-active mode on all nodes of the Trilio cluster.

To restart the wlm-workloads service run on each Trilio node:

systemctl restart wlm-workloads

wlm-cron

The wlm-cron service is responsible for starting scheduled Backups according to the configurtation of Tenant Workloads. It is running in active-passive mode and controlled by the pacemaker cluster.

To restart the wlm-workloads service run on the Trilio node with VIP assigned:

pcs resource restart wlm-cron

VIP resources

The Trilio appliance is running 1 to 4 virtual IPs on the Trilio cluster. These are controlled by the pacemaker cluster and provided through NGINX.

To restart these resources the pacemaker NGINX resource is getting restarted:

pcs resource restart lb_nginx-clone

RabbitMQ

The Trilio cluster is using RabbitMQ as messaging service. It is running in active-active mode on all nodes of the Trilio cluster.

RabbitMQ is a complex system in itself. This guide will only provide the basic commands to do a restart of a node and check the health of the cluster afterward. For complete documentation of how to restart RabbitMQ, please follow the official RabbitMQ documentation.

To restart a RabbitMQ node run on each Trilio node:

It is recommended to wait for the node to rejoin and sync with the cluster before restarting another RabbitMQ node.

[root@TVM1 ~]# rabbitmqctl stop
Stopping and halting node rabbit@TVM1 ...
[root@TVM1 ~]# rabbitmq-server -detached
Warning: PID file not written; -detached was passed.
[root@TVM1 ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@TVM1 ...
[{nodes,[{disc,[rabbit@TVM1,rabbit@TVM2,rabbit@TVM3]}]},
 {running_nodes,[rabbit@TVM2,rabbit@TVM3,rabbit@TVM1]},
 {cluster_name,<<"rabbit@TVM1">>},
 {partitions,[{rabbit@TVM2,[rabbit@TVM1,rabbit@TVM3]},
              {rabbit@TVM3,[rabbit@TVM1,rabbit@TVM2]}]},
 {alarms,[{rabbit@TVM2,[]},{rabbit@TVM3,[]},{rabbit@TVM1,[]}]}]

When the complete cluster is getting stopped and restarted it is important to keep the order of nodes in mind. The last node to be stopped needs to be the first node to be started.

Galera Cluster (MariaDB)

The Galera Cluster is managing the Trilio MariaDB database. It is running in active-active mode on all nodes of the Trilio cluster.

Galera Cluster is a complex system in itself. This guide will only provide the basic commands to do a restart of a node and check the health of the cluster afterward. For complete documentation of how to restart Galera clusters, please follow the official Galera documentation.

When restarting Galera two different use-cases need to be considered:

  • Restarting a single node

  • Restarting the whole cluster

Restarting a single node

A single node can be restarted without any issues. It will automatically rejoin the cluster and sync against the remaining nodes.

The following commands will gracefully stop and restart the mysqld service.

After a restart will the cluster start the syncing process. Don't restart node after node to reach a complete cluster restart.

systemctl stop mysqld
systemctl start mysqld

Check the cluster health after the restart.

Restarting the complete cluster

Restarting a complete cluster requires some additional steps as the Galera cluster is basically destroyed once all nodes have been shut down. It needs to be rebuild afterwards.

First gracefully shutdown the Galera cluster on all nodes:

systemctl stop mysqld

The second step is to identify the Galera node with the latest dataset. This can be achieved by reading the grastate.dat file on the Trilio nodes.

When this documentation is followed the last mysqld service that got shut down will be the one with the latest dataset.

cat /var/lib/mysql/grastate.dat

# GALERA saved state
version: 2.1
uuid:    353e129f-11f2-11eb-b3f7-76f39b7b455d
seqno:   213576545367
safe_to_bootstrap: 1

The value to check for are the seqno.

The node with the highest seqno is the node that contains the latest data. This node will also contain safe_to_bootstrap: 1 to indicate that the Galera cluster can be rebuild from this node.

On the identified node the new cluster is getting generated with the following command:

galera_new_cluster

Running galera_new_cluster on the wrong node will lead to data loss as this command will set the node the command is issued on as the first node of the cluster. All nodes which join afterward will sync against the data of this first node.

After the command has been issued is the mysqld service running on this node. Now the other nodes can be restarted one by one. The started nodes will automatically rejoin the cluster and sync against the master node. Once a synced status has been reached is each node a primary node in the cluster.

systemctl start mysqld

Check the Cluster health after all services are up again.

Verify Health of the Galera Cluster

Verify the cluster health by running the following commands inside each Trilio MariaDB. The values returned from these statements have to be the same for each node.

MariaDB [(none)]> show status like 'wsrep_incoming_addresses';
+--------------------------+-------------------------------------------------+
| Variable_name            | Value                                           |
+--------------------------+-------------------------------------------------+
| wsrep_incoming_addresses | 10.10.2.13:3306,10.10.2.14:3306,10.10.2.12:3306 |
+--------------------------+-------------------------------------------------+
1 row in set (0.01 sec)

MariaDB [(none)]> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 3     |
+--------------------+-------+
1 row in set (0.00 sec)

MariaDB [(none)]> show status like 'wsrep_cluster_state_uuid';
+--------------------------+--------------------------------------+
| Variable_name            | Value                                |
+--------------------------+--------------------------------------+
| wsrep_cluster_state_uuid | 353e129f-11f2-11eb-b3f7-76f39b7b455d |
+--------------------------+--------------------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> show status like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.01 sec)

Canonical workloadmgr container services

Canonical Openstack is not using the Trilio Appliance. In Canonical environments is the Trilio controller unit part of the JuJu deployment as workloadmgr container.

To restart the services inside this container the following commands are to be issued.

Single Node deployment

juju ssh <workloadmgr unit name>/<unit-number>
Systemctl restart wlm-api wlm-scheduler wlm-workloads wlm-cron

HA deployment

On all nodes:

juju ssh <workloadmgr unit name>/<unit-number>
Systemctl restart wlm-api wlm-scheduler wlm-workloads

On a single node:

juju ssh <workloadmgr unit name>/<unit-number>
crm_resource --restart -r res_trilio_wlm_wlm_cron

Trilio dmapi service

The Trilio dmapi service is running on the Openstack controller nodes. Depending on the Openstack Distribution Trilio is installed on different commands are issued to restart the dmapi service.

RHOSP13

RHOSP13 is running the Trilio services as docker containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.

docker restart trilio_dmapi

RHOSP16

RHOSP16 is running the Trilio services as docker containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.

podman restart trilio_dmapi

Canonical

Canonical is running the Trilio services in JuJu controlled LXD containers. The dmapi service can be restarted by issuing the following command from the MASS node.

juju ssh <trilio-dm-api unit name>/<unit-number>
sudo systemctl restart tvault-datamover-api

Kolla-Ansible Openstack

Kolla-Ansible Openstack is running the Trilio services as docker containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.

docker restart triliovault_datamover_api

Ansible Openstack

Ansible Openstack is running the Trilio services as LXD containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.

lxc-stop -n <dmapi container name>
lxc-start -n <dmapi container name>

Trilio datamover service (tvault-contego)

The Trilio datamover service is running on the Openstack compute nodes. Depending on the Openstack Distribution Trilio is installed on different commands are issued to restart the datamover service.

RHOSP13

RHOSP13 is running the Trilio services as docker containers. The datamover service can be restarted by issuing the following command on the compute node.

docker restart trilio_datamover

RHOSP16

RHOSP16 is running the Trilio services as docker containers. The datamover service can be restarted by issuing the following command on the compute node.

podman restart trilio_datamover

Canonical

Canonical is running the Trilio services in JuJu controlled LXD containers. The datamover service can be restarted by issuing the following command from the MASS node.

juju ssh <trilio-data-mover unit name>/<unit-number>
sudo systemctl restart tvault-contego

Kolla-Ansible Openstack

Kolla-Ansible Openstack is running the Trilio services as docker containers. The dmapi service can be restarted by issuing the following command on the host running the dmapi service.

docker restart triliovault_datamover

Ansible Openstack

Ansible Openstack is running the Trilio datamover service directly on the compute node. The datamover service can be restarted by issuing the following command on.

service tvault-contego restart

Last updated