General Troubleshooting Tips

Troubleshooting inside a complex environment like OpenStack can be very time-consuming.

The following tips will help to speed up the troubleshooting process to identify root causes.

What is happening where

OpenStack and Trilio are divided into multiple services. Each service has a very specific purpose that is called during a backup or recovery procedure. Knowing which service is doing what helps to understand where the error is happening, allowing more focused troubleshooting.

Trilio Workloadmgr

The Trilio Workloadmgr is the Controller of Trilio. It receives all Workload related requests from the users.

Every task of a backup or restore process is triggered and managed from here. This includes the creation of the directory structure and initial metadata files on the Backup Target.

During a backup process

During a backup process is the Trilio Workloadmgr also responsible for gathering the metadata about the backed-up VMs and networks from the OpenStack environment. It sends API calls to the OpenStack endpoints on the configured endpoint type to fetch this information. Once the metadata has been received, the Trilio Workloadmgr writes it as JSON files on the Backup Target.

The Trilio Workloadmgr is also sending the Cinder Snapshot command.

During a restore process

During the restore process, the Trilio Workloadmgr reads the VM metadata from its Database and uses the metadata to create the Shell for the restore. It sends API calls to the OpenStack environment to create the necessary resources.

Datamover API

The dmapi service is the connector between the Trilio cluster and the Datamover running on the compute nodes.

The purpose of the dmapi service is to identify which compute node is responsible for the current backup or restore task. To do so, the dmapi service connects to the nova API requesting the compute hose of a provided VM.

Once the compute host has been identified, the dmapi forwards the command from the Trilio Workloadmgr to the datamover running on the identified compute host.

Datamover

The datamover is the Trilio service running on the compute nodes.

Each datamover is responsible for the VMs running on top of its compute node. A datamover can not work with VMs running on a different compute node.

The datamover controls the freeze and thaw of VMs as well as the actual movement of the data.

Everything on the Backup Target is happening as user nova

Trilio is reading and writing on the Backup Target as nova:nova.

The POSIX user-id and group-id of nova:nova need to be aligned between the Trilio Cluster and all compute nodes. Otherwise, backup or restores may fail with permission or file not found issues.

Alternativ ways to achieve the goal are possible, as long as all required nodes can fully write and read as nova:nova on the Backup Target.

It is recommended to verify the required permissions on the Backup Target in case of any errors during the data transfer phase or in case of any file permission errors.

Input/Output Error with Cohesity NFS

On Cohesity NFS if an Input/Output error is observed, then increase the timeo and retrans parameter values in your NFS options.

Timeout receiving packet error in multipath ISCSi environment

Logging inside all datamover containers and add uxsock_timeout with value as 60000 which is equal to 60 sec inside /etc/multipath.conf.Restart datamover container

Trilio Trustee Role

Trilio is using RBAC to allow the usage of Trilio features to users.

This trustee role is absolutely required and can not be overwritten using the admin role.

It is recommended to verify the assignment of the Trilio Trustee Role in case of any permission errors from Trilio during the creation of Workloads, backups, or restores.

OpenStack Quotas

Trilio is creating Cinder Snapshots and temporary Cinder Volumes. The OpenStack Quotas need to allow that.

Every disk that is getting backed up requires one temporary Cinder Volumes.

Every Cinder Volume that is getting backup requires two Cinder Snapshots. The second Cinder Snapshot is temporary to calculate the incremental.

Last updated