Troubleshooting inside a complex environment like Openstack can be very time-consuming. The following tipps will help to speed up the troubleshooting process to identify root causes.
Openstack and TrilioVault are divided into multiple services. Each service has a very specific purpose that is called during a backup or recovery procedure. Knowing which service is doing what helps to understand where the error is happening, allowing more focused troubleshooting.
The TrilioVault Cluster is the Controller of TrilioVault. It receives all Workload related requests from the users.
Every task of a backup or restore process is triggered and managed from here. This includes the creation of the directory structure and initial metadata files on the Backup Target.
During a backup process is the TrilioVault cluster also responsible to gather the metadata about the backed up VMs and networks from the Openstack environment. It is sending API calls towards the Openstack endpoints on the configured endpoint type to fetch this information. Once the metadata has been received does the TrilioVault Cluster write it as json files on the Backup Target.
The TrilioVault cluster is also sending the Cinder Snapshot command.
During restore process is the TrilioVault cluster reading the VM metadata from its Database and is using the metadata to create the Shell for the restore. It is sending API calls to the Openstack environment to create the necessary resources.
The dmapi service is the connector between the TrilioVault cluster and the datamover running on the compute nodes.
The purpose of the dmapi service is to identify which compute node is responsible for the current backup or restore task. To do so is the dmapi service connecting to the nova api database requesting the compute hose of a provided VM.
Once the compute host has been identified is the dmapi forwarding the command from the TrilioVault Cluster to the datamover running on the identified compute host.
The datamover is the TrilioVault service running on the compute nodes.
Each datamover is responsible for the VMs running on top of its compute node. A datamover can not work with VMs running on a different compute node.
The datamover is controlling the freeze and thaw of VMs as well as the actual movement of the data.
TrilioVault is reading and writing on the Backup Target as nova:nova.
The POSIX user-id and group-id of nova:nova need to be aligned between the TrilioVault Cluster and all compute nodes. Otherwise backup or restores may fail with permission or file not found issues.
Alternativ ways to achieve the goal are possible, as long as all required nodes can fully write and read as nova:nova on the Backup Target.
It is recommended to verify the required permissions on the Backup Target in case of any errors during the data transfer phase or in case of any file permission errors.
TrilioVault is using RBAC to allow the usage of TrilioVault features to users.
This trustee role is absolutely required and can not be overwritten using the admin role.
It is recommended to verify the assignment of the TrilioVault Trustee Role in case of any permission errors from TrilioVault during creation of Workloads, backups or restores.
TrilioVault is creating Cinder Snapshots and temporary Cinder Volumes. The Openstack Quotas need to allow that.
Every disk that is getting backed up requires one temporary Cinder Volumes.
Every Cinder Volume that is getting backup requires two Cinder Snapshots. The second Cinder Snapshot is temporary to calculate the incremental.