General Troubleshooting Tips
Troubleshooting inside a complex environment like Openstack can be very time-consuming. The following tipps will help to speed up the troubleshooting process to identify root causes.
Openstack and Trilio are divided into multiple services. Each service has a very specific purpose that is called during a backup or recovery procedure. Knowing which service is doing what helps to understand where the error is happening, allowing more focused troubleshooting.
The Trilio Cluster is the Controller of Trilio. It receives all Workload related requests from the users.
Every task of a backup or restore process is triggered and managed from here. This includes the creation of the directory structure and initial metadata files on the Backup Target.
During a backup process is the Trilio cluster also responsible to gather the metadata about the backed up VMs and networks from the Openstack environment. It is sending API calls towards the Openstack endpoints on the configured endpoint type to fetch this information. Once the metadata has been received does the Trilio Cluster write it as json files on the Backup Target.
The Trilio cluster is also sending the Cinder Snapshot command.
During restore process is the Trilio cluster reading the VM metadata from its Database and is using the metadata to create the Shell for the restore. It is sending API calls to the Openstack environment to create the necessary resources.
The dmapi service is the connector between the Trilio cluster and the datamover running on the compute nodes.
The purpose of the dmapi service is to identify which compute node is responsible for the current backup or restore task. To do so is the dmapi service connecting to the nova api database requesting the compute hose of a provided VM.
Once the compute host has been identified is the dmapi forwarding the command from the Trilio Cluster to the datamover running on the identified compute host.
The datamover is the Trilio service running on the compute nodes.
Each datamover is responsible for the VMs running on top of its compute node. A datamover can not work with VMs running on a different compute node.
The datamover is controlling the freeze and thaw of VMs as well as the actual movement of the data.
Trilio is reading and writing on the Backup Target as nova:nova.
The POSIX user-id and group-id of nova:nova need to be aligned between the Trilio Cluster and all compute nodes. Otherwise backup or restores may fail with permission or file not found issues.
Alternativ ways to achieve the goal are possible, as long as all required nodes can fully write and read as nova:nova on the Backup Target.
It is recommended to verify the required permissions on the Backup Target in case of any errors during the data transfer phase or in case of any file permission errors.
On Cohesity NFS if an Input/Output error is observed, then increase the timeo and retrans parameter values in your NFS options.
Logging inside all datamover containers and add uxsock_timeout with value as 60000 which is equals to 60 sec inside /etc/multipath.conf.Restart datamover container
Trilio is using RBAC to allow the usage of Trilio features to users.
This trustee role is absolutely required and can not be overwritten using the admin role.
It is recommended to verify the assignment of the Trilio Trustee Role in case of any permission errors from Trilio during creation of Workloads, backups or restores.
Trilio is creating Cinder Snapshots and temporary Cinder Volumes. The Openstack Quotas need to allow that.
Every disk that is getting backed up requires one temporary Cinder Volumes.
Every Cinder Volume that is getting backup requires two Cinder Snapshots. The second Cinder Snapshot is temporary to calculate the incremental.
Once Trilio is configured use virtual IP to access its dashboard. If service status panels on the dashboard page are not visible then access the virtual IP on port 3001 (https://<T4O-VIP>:3001/) and accept the SSL exception, and then refresh the dashboard page.