Trilio Workloads are designed to allow a Desaster Recovery without the need to backup the Trilio database.
As long as the Trilio Workloads are existing on the Backup Target Storage and a Trilio installation has access to them, it is possible to restore the Workloads.
Notify users to of Workloads being available
This procedure is designed to be applicable to all Openstack installations using Trilio. It is to be used as a starting point to develop the exact Desaster Recovery process of a specific environment.
In case that instead of noticing the users, the workloads shall be restored is it necessary to have an User in each Project, that has the necessary privileges to restore.
Trilio incremental Snapshots involve a backing file to the prior backup taken, which makes every Trilio incremental backup a synthetic full backup.
Trilio is using qcow2 backing files for this feature:
As can be seen in the example is the backing file an absolute path, which makes it necessary, that this path exists so the backing files can be accessed.
Trilio is using the base64 hashing algorithm for the NFS mount-paths, to allow the configuration of multiple NFS Volumes at the same time. The hash value is calculated using the provided NFS path.
When the path of the backing file is not available on the Trilio appliance and Compute nodes, will the restores of incremental backups fail.
The tested and recommended method to make the backing files available is creating the required directory path and using mount --bind
to make the path available for the backups.
Running the mount --bind command will make the necessary path available until the next reboot. If it is required to have access to the path beyond a reboot is it necessary to edit the fstab.
This runbook will demonstrate how to set up Disaster Recovery with Trilio for a given scenario.
The chosen scenario is following an actively used Trilio customer environment.
There are two Openstack clouds available "Openstack Cloud A" and Openstack Cloud B". "Openstack Cloud B" is the Disaster Recovery restore point of "Openstack Cloud A" and vice versa. Both clouds have an independent Trilio installation integrated. These Trilio installations are writing their Backups to NFS targets. "Trilio A" is writing to "NFS A1" and "Trilio B" is writing to "NFS B1". The NFS Volumes used are getting synced against another NFS Volume on the other side. "NFS A1" is syncing with "NFS B2" and "NFS B1" is syncing with "NFS A2". The syncing process is set up independently from Trilio and will always favor the newer dataset.
This scenario will cover the Disaster Recovery of a single Workload and a complete Cloud. All processes are done be the Openstack administrator.
This runbook will assume that the following is true:
"Openstack Cloud A" and "Openstack Cloud B" both have an active Trilio installation with a valid license
"Openstack Cloud A" and "Openstack Cloud B" have free resources to host additional VMs
"Openstack Cloud A" and "Openstack Cloud B" have Tenants/Projects available that are the designated restore points for Tenant/Projects of the other side
Access to a user with the admin role permissions on domain level
One of the Openstack clouds is down/lost
For ease of writing will this runbook assume from here on, that "Openstack Cloud A" is down and the Workloads are getting restored into "Openstack Cloud B".
In the case of the usage of shared Tenant networks, beyond the floating IP, the following additional requirement is needed: All Tenant Networks, Routers, Ports, Floating IPs, and DNS Zones are created
A single Workload can do a Disaster Recovery in this Scenario, while both Clouds are still active. To do so the following high-level process needs to be followed:
Copy the Workload directories to the configured NFS Volume
Make the right Mount-Paths available
Reassign the Workload
Restore the Workload
Clean up
This process only shows how to get a Workload from "Openstack Cloud A" to "Openstack Cloud B". The vice versa process is similar.
As only a single Workload is to be recovered it is more efficient to copy the data of that single Workload over to the "NFS B1" Volume, which is used by "Trilio B".
It is recommended to use the Trilio VM as a connector between both NFS Volumes, as the nova user is available on the Trilio VM.
Trilio Workloads are identified by their ID und which they are stored on the Backup Target. See below example:
In the case that the Workload ID is not known can available Metadata inside the Workload directories be used to identify the correct Workload.
The identified workload needs to be copied with all subdirectories and files. Afterward, it is necessary to adjust the ownership to nova:nova with the right permissions.
Trilio backups are using qcow2 backing files, which make every incremental backup a full synthetic backup. These backing files can be made visible using the qemu-img tool.
The MTAuMTAuMi4yMDovdXBzdHJlYW0=
part of the backing file path is the base64 hash value, which will be calculated upon the configuration of a Trilio installation for each provided NFS-Share.
This hash value is calculated based on the provided NFS-Share path: <NFS_IP>/<path> If even one character in the NFS-Share path is different between the provided NFS-Share paths a completely different hash value is generated.
Workloads, that have been moved between NFS-Shares, require that their incremental backups can follow the same path as on their original Source Cloud. To achieve this it is necessary to create the mount path on all compute nodes of the Target Cloud.
Afterwards a mount bind is used to make the workloads data accessible over the old and the new mount path. The following example shows the process of how to successfully identify the necessary mount points and create the mount bind.
The used hash values can be calculated using the base64 tool in any Linux distribution.
Based on the identified base64 hash values the following paths are required on each Compute node.
/var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW1fc291cmNl
and
/var/triliovault-mounts/MTAuMjAuMy4yMjovdXBzdHJlYW1fdGFyZ2V0
In the scenario of this runbook is the workload coming from the NFS_A1 NFS-Share, which means the mount path of that NFS-Share needs to be created and bound to the Target Cloud.
To keep the desired mount past a reboot it is recommended to edit the fstab of all compute nodes accordingly.
Trilio workloads have clear ownership. When a workload is moved to a different cloud it is necessary to change the ownership. The ownership can only be changed by Openstack administrators.
To fulfill the required tasks an admin role user is used. This user will be used until the workload has been restored. Therefore, it is necessary to provide this user access to the desired Target Project on the Target Cloud.
Each Trilio installation maintains a database of workloads that are known to the Trilio installation. Workloads that are not maintained by a specific Trilio installation, are from the perspective of that installation, orphaned workloads. An orphaned workload is a workload accessible on the NFS-Share, that is not assigned to any existing project in the Cloud the Trilio installation is protecting.
The identified orphaned workloads need to be assigned to their new projects. The following provides the list of all available projects viewable by the used admin-user in the target_domain.
To allow project owners to work with the workloads as well will they get assigned to a user with the backup trustee role that is existing in the target project.
Now that all informations have been gathered the workload can be reassigned to the target project.
After the workload has been assigned to the new project it is recommended to verify the workload is managed by the Target Trilio and is assigned to the right project and user.
This runbook will continue on the CLI only path.
To be able to do the necessary selective restore a few pieces of information about the snapshot to be restored are required. The following process will provide all necessary information.
List all Snapshots of the workload to restore to identify the snapshot to restore
Get Snapshot Details with network details for the desired snapshot
Get Snapshot Details with disk details for the desired Snapshot
The selective restore is using a restore.json file for the CLI command. This restore.json file needs to be adjusted according to the desired restore.
To do the actual restore use the following command:
To verify the success of the restore from a Trilio perspective the restore status is checked.
After the Desaster Recovery Process has been successfully completed it is recommended to bring the TVM installation back into its original state to be ready for the next DR process.
Delete the workload that got restored.
The Trilio database is following the Openstack standard of not deleting any database entries upon deletion of the cloud object. Any Workload, Snapshot or Restore, which gets deleted, is marked as deleted only.
To allow the Trilio installation to be ready for another disaster recovery it is necessary to completely delete the entries of the Workloads, which have been restored.
Trilio does provide and maintain a script to safely delete workload entries and all connected entities from the Trilio database.
After all restores for the target project have been achieved it is recommended to remove the used admin user from the project again.
This Scenario will cover the Disaster Recovery of a full cloud. It is assumed that the source cloud is down or lost completly. To do the disaster recovery the following high-level process needs to be followed:
Reconfigure the Target Trilio installation
Make the right Mount-Paths available
Reassign the Workload
Restore the Workload
Reconfigure the Target Trilio installation back to the original one
Clean up
Before the Desaster Recovery Process can start is it necessary to make the backups to be restored available for the Trilio installation. The following steps need to be done to completely reconfigure the Trilio installation.
During the reconfiguration process will all backups of the Target Region be on hold and it is not recommended to create new Backup Jobs until the Desaster Recovery Process has finished and the original Trilio configuration has been restored.
Edit the workloadmgr.conf
Look for the line defining the NFS mounts
Add NFS B2 to that as comma-seperated list. Space is not necessary, but can be set.
Write and close the workloadmgr.conf
Restart the wlm-workloads service
Trilio is integrating natively into the Openstack deployment tools. When using the Red Hat director or JuJu charms it is recommended to adapt the environment files for these orchestrators and update the Datamovers through them.
To add the NFS B2 to the Trilio Datamovers manually the tvault-contego.conf file needs to be edited and the service restarted.
Edit the tvault-contego.conf
Look for the line defining the NFS mounts
Add NFS B2 to that as comma-seperated list. Space is not necessary, but can be set.
Write and close the tvault-contego.conf
Restart the tvault-contego service
Trilio backups are using qcow2 backing files, which make every incremental backup a full synthetic backup. These backing files can be made visible using the qemu-img tool.
The MTAuMTAuMi4yMDovdXBzdHJlYW0=
part of the backing file path is the base64 hash value, which will be calculated upon the configuration of a Trilio installation for each provided NFS-Share.
This hash value is calculated based on the provided NFS-Share path: <NFS_IP>/<path> If even one character in the NFS-Share path is different between the provided NFS-Share paths a completely different hash value is generated.
Workloads, that have been moved between NFS-Shares, require that their incremental backups can follow the same path as on their original Source Cloud. To achieve this it is necessary to create the mount path on all compute nodes of the Target Cloud.
Afterwards a mount bind is used to make the workloads data accessible over the old and the new mount path. The following example shows the process of how to successfully identify the necessary mount points and create the mount bind.
The used hash values can be calculated using the base64 tool in any Linux distribution.
Based on the identified base64 hash values the following paths are required on each Compute node.
/var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW1fc291cmNl
and
/var/triliovault-mounts/MTAuMjAuMy4yMjovdXBzdHJlYW1fdGFyZ2V0
In the scenario of this runbook is the workload coming from the NFS_A1 NFS-Share, which means the mount path of that NFS-Share needs to be created and bound to the Target Cloud.
To keep the desired mount past a reboot it is recommended to edit the fstab of all compute nodes accordingly.
Trilio workloads have clear ownership. When a workload is moved to a different cloud it is necessary to change the ownership. The ownership can only be changed by Openstack administrators.
To fulfill the required tasks an admin role user is used. This user will be used until the workload has been restored. Therefore, it is necessary to provide this user access to the desired Target Project on the Target Cloud.
Each Trilio installation maintains a database of workloads that are known to the Trilio installation. Workloads that are not maintained by a specific Trilio installation, are from the perspective of that installation, orphaned workloads. An orphaned workload is a workload accessible on the NFS-Share, that is not assigned to any existing project in the Cloud the Trilio installation is protecting.
The identified orphaned workloads need to be assigned to their new projects. The following provides the list of all available projects viewable by the used admin-user in the target_domain.
To allow project owners to work with the workloads as well will they get assigned to a user with the backup trustee role that is existing in the target project.
Now that all informations have been gathered the workload can be reassigned to the target project.
After the workload has been assigned to the new project it is recommended to verify the workload is managed by the Target Trilio and is assigned to the right project and user.
This runbook will continue on the CLI only path.
To be able to do the necessary selective restore a few pieces of information about the snapshot to be restored are required. The following process will provide all necessary information.
List all Snapshots of the workload to restore to identify the snapshot to restore
Get Snapshot Details with network details for the desired snapshot
Get Snapshot Details with disk details for the desired Snapshot
The selective restore is using a restore.json file for the CLI command. This restore.json file needs to be adjusted according to the desired restore.
To do the actual restore use the following command:
To verify the success of the restore from a Trilio perspective the restore status is checked.
After the Desaster Recovery Process has finished it is necessary to return the Trilio installation to its original configuration. The following steps need to be done to completely reconfigure the Trilio installation.
During the reconfiguration process will all backups of the Target Region be on hold and it is not recommended to create new Backup Jobs until the Desaster Recovery Process has finished and the original Trilio configuration has been restored.
Edit the workloadmgr.conf
Look for the line defining the NFS mounts
Delete NFS B2 from the comma-seperated list
Write and close the workloadmgr.conf
Restart the wlm-workloads service
Trilio is integrating natively into the Openstack deployment tools. When using the Red Hat director or JuJu charms it is recommended to adapt the environment files for these orchestrators and update the Datamovers through them.
To add the NFS B2 to the Trilio Datamovers manually the tvault-contego.conf file needs to be edited and the service restarted.
Edit the tvault-contego.conf
Look for the line defining the NFS mounts
Add NFS B2 to that as comma-seperated list. Space is not necessary, but can be set.
Write and close the tvault-contego.conf
Restart the tvault-contego service
After the Desaster Recovery Process has been successfully completed and the Trilio installation reconfigured to its original state, it is recommended to do the following additional steps to be ready for the next Disaster Recovery process.
The Trilio database is following the Openstack standard of not deleting any database entries upon deletion of the cloud object. Any Workload, Snapshot or Restore, which gets deleted, is marked as deleted only.
To allow the Trilio installation to be ready for another disaster recovery it is necessary to completely delete the entries of the Workloads, which have been restored.
Trilio does provide and maintain a script to safely delete workload entries and all connected entities from the Trilio database.
After all restores for the target project have been achieved it is recommended to remove the used admin user from the project again.
The reassigned workload can be restored using Horizon following the procedure described .
This script can be found here:
To add the NFS-Vol2 to the Trilio Appliance cluster the Trilio can either be to use both NFS Volumes or it is possible to edit the configuration file and then restart all services. This procedure describes how to edit the conf file and restart the services. This needs to be repeated on every Trilio Appliance.
The reassigned workload can be restored using Horizon following the procedure described .
To add the NFS-Vol2 to the Trilio Appliance cluster the Trilio can either be to use both NFS Volumes or it is possible to edit the configuration file and then restart all services. This procedure describes how to edit the conf file and restart the services. This needs to be repeated on every Trilio Appliance.
This script can be found here: