This runbook will demonstrate how to set up Disaster Recovery with Trilio for a given scenario.
The chosen scenario is following an actively used Trilio customer environment.
Scenario
There are two Openstack clouds available "Openstack Cloud A" and Openstack Cloud B".
"Openstack Cloud B" is the Disaster Recovery restore point of "Openstack Cloud A" and vice versa.
Both clouds have an independent Trilio installation integrated.
These Trilio installations are writing their Backups to NFS targets.
"Trilio A" is writing to "NFS A1" and "Trilio B" is writing to "NFS B1".
The NFS Volumes used are getting synced against another NFS Volume on the other side.
"NFS A1" is syncing with "NFS B2" and "NFS B1" is syncing with "NFS A2".
The syncing process is set up independently from Trilio and will always favor the newer dataset.
This scenario will cover the Disaster Recovery of a single Workload and a complete Cloud. All processes are done be the Openstack administrator.
Prerequisites for the Disaster Recovery process
This runbook will assume that the following is true:
"Openstack Cloud A" and "Openstack Cloud B" both have an active Trilio installation with a valid license
"Openstack Cloud A" and "Openstack Cloud B" have free resources to host additional VMs
"Openstack Cloud A" and "Openstack Cloud B" have Tenants/Projects available that are the designated restore points for Tenant/Projects of the other side
Access to a user with the admin role permissions on domain level
One of the Openstack clouds is down/lost
For ease of writing will this runbook assume from here on, that "Openstack Cloud A" is down and the Workloads are getting restored into "Openstack Cloud B".
In the case of the usage of shared Tenant networks, beyond the floating IP, the following additional requirement is needed:
All Tenant Networks, Routers, Ports, Floating IPs, and DNS Zones are created
Disaster Recovery of a single Workload
A single Workload can do a Disaster Recovery in this Scenario, while both Clouds are still active. To do so the following high-level process needs to be followed:
Copy the Workload directories to the configured NFS Volume
Make the right Mount-Paths available
Reassign the Workload
Restore the Workload
Clean up
Copy the Workload directories to the configured NFS Volume
This process only shows how to get a Workload from "Openstack Cloud A" to "Openstack Cloud B". The vice versa process is similar.
As only a single Workload is to be recovered it is more efficient to copy the data of that single Workload over to the "NFS B1" Volume, which is used by "Trilio B".
Mount "NFS B2" Volume to a Trilio VM
It is recommended to use the Trilio VM as a connector between both NFS Volumes, as the nova user is available on the Trilio VM.
# mount <NFS B2-IP/NFS B2-FQDN>:/<VOL-Path> /mnt
Identify the Workload on the "NFS B2" Volume
Trilio Workloads are identified by their ID und which they are stored on the Backup Target. See below example:
workload_ac9cae9b-5e1b-4899-930c-6aa0600a2105
In the case that the Workload ID is not known can available Metadata inside the Workload directories be used to identify the correct Workload.
/…/workload_<id>/workload_db <<< Contains User ID and Project ID of Workload owner
/…/workload_<id>/workload_vms_db <<< Contains VM IDs and VM Names of all VMs actively protected be the Workload
Copy the Workload
The identified workload needs to be copied with all subdirectories and files. Afterward, it is necessary to adjust the ownership to nova:nova with the right permissions.
Trilio backups are using qcow2 backing files, which make every incremental backup a full synthetic backup.
These backing files can be made visible using the qemu-img tool.
#qemu-img info bd57ec9b-c4ac-4a37-a4fd-5c9aa002c778
image: bd57ec9b-c4ac-4a37-a4fd-5c9aa002c778
file format: qcow2
virtual size: 1.0G (1073741824 bytes)
disk size: 516K
cluster_size: 65536
backing file: /var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW0=/workload_ac9cae9b-5e1b-4899-930c-6aa0600a2105/snapshot_1415095d-c047-400b-8b05-c88e57011263/vm_id_38b620f1-24ae-41d7-b0ab-85ffc2d7958b/vm_res_id_d4ab3431-5ce3-4a8f-a90b-07606e2ffa33_vda/7c39eb6a-6e42-418e-8690-b6368ecaa7bb
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
The MTAuMTAuMi4yMDovdXBzdHJlYW0= part of the backing file path is the base64 hash value, which will be calculated upon the configuration of a Trilio installation for each provided NFS-Share.
This hash value is calculated based on the provided NFS-Share path: <NFS_IP>/<path>
If even one character in the NFS-Share path is different between the provided NFS-Share paths a completely different hash value is generated.
Workloads, that have been moved between NFS-Shares, require that their incremental backups can follow the same path as on their original Source Cloud. To achieve this it is necessary to create the mount path on all compute nodes of the Target Cloud.
Afterwards a mount bind is used to make the workloads data accessible over the old and the new mount path. The following example shows the process of how to successfully identify the necessary mount points and create the mount bind.
Identify the base64 hash values
The used hash values can be calculated using the base64 tool in any Linux distribution.
In the scenario of this runbook is the workload coming from the NFS_A1 NFS-Share, which means the mount path of that NFS-Share needs to be created and bound to the Target Cloud.
Trilio workloads have clear ownership. When a workload is moved to a different cloud it is necessary to change the ownership. The ownership can only be changed by Openstack administrators.
Add admin-user to required domains and projects
To fulfill the required tasks an admin role user is used.
This user will be used until the workload has been restored.
Therefore, it is necessary to provide this user access to the desired Target Project on the Target Cloud.
Discover orphaned Workloads from NFS-Storage of Target Cloud
Each Trilio installation maintains a database of workloads that are known to the Trilio installation. Workloads that are not maintained by a specific Trilio installation, are from the perspective of that installation, orphaned workloads.
An orphaned workload is a workload accessible on the NFS-Share, that is not assigned to any existing project in the Cloud the Trilio installation is protecting.
# workloadmgr workload-get-orphaned-workloads-list --migrate_cloud True
+------------+--------------------------------------+----------------------------------+----------------------------------+
| Name | ID | Project ID | User ID |
+------------+--------------------------------------+----------------------------------+----------------------------------+
| Workload_1 | 6639525d-736a-40c5-8133-5caaddaaa8e9 | 4224d3acfd394cc08228cc8072861a35 | 329880dedb4cd357579a3279835f392 |
| Workload_2 | 904e72f7-27bb-4235-9b31-13a636eb9c95 | 637a9ce3fd0d404cabf1a776696c9c04 | 329880dedb4cd357579a3279835f392 |
+------------+--------------------------------------+----------------------------------+----------------------------------+
List available projects on Target Cloud in the Target Domain
The identified orphaned workloads need to be assigned to their new projects. The following provides the list of all available projects viewable by the used admin-user in the target_domain.
# openstack project list --domain <target_domain>
+----------------------------------+----------+
| ID | Name |
+----------------------------------+----------+
| 01fca51462a44bfa821130dce9baac1a | project1 |
| 33b4db1099ff4a65a4c1f69a14f932ee | project2 |
| 9139e694eb984a4a979b5ae8feb955af | project3 |
+----------------------------------+----------+
List available users on the Target Cloud in the Target Project that have the right backup trustee role
To allow project owners to work with the workloads as well will they get assigned to a user with the backup trustee role that is existing in the target project.
Now that all informations have been gathered the workload can be reassigned to the target project.
# workloadmgr workload-reassign-workloads --new_tenant_id {target_project_id} --user_id {target_user_id} --workload_ids {workload_id} --migrate_cloud True
+-----------+--------------------------------------+----------------------------------+----------------------------------+
| Name | ID | Project ID | User ID |
+-----------+--------------------------------------+----------------------------------+----------------------------------+
| project1 | 904e72f7-27bb-4235-9b31-13a636eb9c95 | 4f2a91274ce9491481db795dcb10b04f | 3e05cac47338425d827193ba374749cc |
+-----------+--------------------------------------+----------------------------------+----------------------------------+
Verify the workload is available at the desired target_project
After the workload has been assigned to the new project it is recommended to verify the workload is managed by the Target Trilio and is assigned to the right project and user.
The reassigned workload can be restored using Horizon following the procedure described here.
This runbook will continue on the CLI only path.
Prepare the selective restore by getting the snapshot information
To be able to do the necessary selective restore a few pieces of information about the snapshot to be restored are required. The following process will provide all necessary information.
List all Snapshots of the workload to restore to identify the snapshot to restore
# workloadmgr snapshot-list --workload_id ac9cae9b-5e1b-4899-930c-6aa0600a2105 --all True
+----------------------------+--------------+--------------------------------------+--------------------------------------+---------------+-----------+-----------+
| Created At | Name | ID | Workload ID | Snapshot Type | Status | Host |
+----------------------------+--------------+--------------------------------------+--------------------------------------+---------------+-----------+-----------+
| 2019-11-02T02:30:02.000000 | jobscheduler | f5b8c3fd-c289-487d-9d50-fe27a6561d78 | ac9cae9b-5e1b-4899-930c-6aa0600a2105 | full | available | Upstream2 |
| 2019-11-03T02:30:02.000000 | jobscheduler | 7e39e544-537d-4417-853d-11463e7396f9 | ac9cae9b-5e1b-4899-930c-6aa0600a2105 | incremental | available | Upstream2 |
| 2019-11-04T02:30:02.000000 | jobscheduler | 0c086f3f-fa5d-425f-b07e-a1adcdcafea9 | ac9cae9b-5e1b-4899-930c-6aa0600a2105 | incremental | available | Upstream2 |
+----------------------------+--------------+--------------------------------------+--------------------------------------+---------------+-----------+-----------+
Get Snapshot Details with network details for the desired snapshot
Prepare the selective restore by creating the restore.json file
The selective restore is using a restore.json file for the CLI command. This restore.json file needs to be adjusted according to the desired restore.
{
u'description':u'<description of the restore>',
u'oneclickrestore':False,
u'restore_type':u'selective',
u'type':u'openstack',
u'name':u'<name of the restore>'
u'openstack':{
u'instances':[
{
u'name':u'<name instance 1>',
u'availability_zone':u'<AZ instance 1>',
u'nics':[ #####Leave empty for network topology restore
],
u'vdisks':[
{
u'id':u'<old disk id>',
u'new_volume_type':u'<new volume type name>',
u'availability_zone':u'<new cinder volume AZ>'
}
],
u'flavor':{
u'ram':<RAM in MB>,
u'ephemeral':<GB of ephemeral disk>,
u'vcpus':<# vCPUs>,
u'swap':u'<GB of Swap disk>',
u'disk':<GB of boot disk>,
u'id':u'<id of the flavor to use>'
},
u'include':<True/False>,
u'id':u'<old id of the instance>'
} #####Repeat for each instance in the snapshot
],
u'restore_topology':<True/False>,
u'networks_mapping':{
u'networks':[ #####Leave empty for network topology restore
]
}
}
}
Run the selective restore
To do the actual restore use the following command:
After the Desaster Recovery Process has been successfully completed it is recommended to bring the TVM installation back into its original state to be ready for the next DR process.
Delete the workload
Delete the workload that got restored.
# workloadmgr workload-delete <workload_id>
Remove the database entry
The Trilio database is following the Openstack standard of not deleting any database entries upon deletion of the cloud object. Any Workload, Snapshot or Restore, which gets deleted, is marked as deleted only.
To allow the Trilio installation to be ready for another disaster recovery it is necessary to completely delete the entries of the Workloads, which have been restored.
Trilio does provide and maintain a script to safely delete workload entries and all connected entities from the Trilio database.
This Scenario will cover the Disaster Recovery of a full cloud. It is assumed that the source cloud is down or lost completly. To do the disaster recovery the following high-level process needs to be followed:
Reconfigure the Target Trilio installation
Make the right Mount-Paths available
Reassign the Workload
Restore the Workload
Reconfigure the Target Trilio installation back to the original one
Clean up
Reconfigure the Target Trilio installation
Before the Desaster Recovery Process can start is it necessary to make the backups to be restored available for the Trilio installation.
The following steps need to be done to completely reconfigure the Trilio installation.
During the reconfiguration process will all backups of the Target Region be on hold and it is not recommended to create new Backup Jobs until the Desaster Recovery Process has finished and the original Trilio configuration has been restored.
Add NFS B2 to the Trilio Appliance Cluster
To add the NFS-Vol2 to the Trilio Appliance cluster the Trilio can either be fully reconfigured to use both NFS Volumes or it is possible to edit the configuration file and then restart all services. This procedure describes how to edit the conf file and restart the services. This needs to be repeated on every Trilio Appliance.
Trilio is integrating natively into the Openstack deployment tools. When using the Red Hat director or JuJu charms it is recommended to adapt the environment files for these orchestrators and update the Datamovers through them.
To add the NFS B2 to the Trilio Datamovers manually the tvault-contego.conf file needs to be edited and the service restarted.
Trilio backups are using qcow2 backing files, which make every incremental backup a full synthetic backup.
These backing files can be made visible using the qemu-img tool.
#qemu-img info bd57ec9b-c4ac-4a37-a4fd-5c9aa002c778
image: bd57ec9b-c4ac-4a37-a4fd-5c9aa002c778
file format: qcow2
virtual size: 1.0G (1073741824 bytes)
disk size: 516K
cluster_size: 65536
backing file: /var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW0=/workload_ac9cae9b-5e1b-4899-930c-6aa0600a2105/snapshot_1415095d-c047-400b-8b05-c88e57011263/vm_id_38b620f1-24ae-41d7-b0ab-85ffc2d7958b/vm_res_id_d4ab3431-5ce3-4a8f-a90b-07606e2ffa33_vda/7c39eb6a-6e42-418e-8690-b6368ecaa7bb
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
The MTAuMTAuMi4yMDovdXBzdHJlYW0= part of the backing file path is the base64 hash value, which will be calculated upon the configuration of a Trilio installation for each provided NFS-Share.
This hash value is calculated based on the provided NFS-Share path: <NFS_IP>/<path>
If even one character in the NFS-Share path is different between the provided NFS-Share paths a completely different hash value is generated.
Workloads, that have been moved between NFS-Shares, require that their incremental backups can follow the same path as on their original Source Cloud. To achieve this it is necessary to create the mount path on all compute nodes of the Target Cloud.
Afterwards a mount bind is used to make the workloads data accessible over the old and the new mount path. The following example shows the process of how to successfully identify the necessary mount points and create the mount bind.
Identify the base64 hash values
The used hash values can be calculated using the base64 tool in any Linux distribution.
In the scenario of this runbook is the workload coming from the NFS_A1 NFS-Share, which means the mount path of that NFS-Share needs to be created and bound to the Target Cloud.
Trilio workloads have clear ownership. When a workload is moved to a different cloud it is necessary to change the ownership. The ownership can only be changed by Openstack administrators.
Add admin-user to required domains and projects
To fulfill the required tasks an admin role user is used.
This user will be used until the workload has been restored.
Therefore, it is necessary to provide this user access to the desired Target Project on the Target Cloud.
Discover orphaned Workloads from NFS-Storage of Target Cloud
Each Trilio installation maintains a database of workloads that are known to the Trilio installation. Workloads that are not maintained by a specific Trilio installation, are from the perspective of that installation, orphaned workloads.
An orphaned workload is a workload accessible on the NFS-Share, that is not assigned to any existing project in the Cloud the Trilio installation is protecting.
# workloadmgr workload-get-orphaned-workloads-list --migrate_cloud True
+------------+--------------------------------------+----------------------------------+----------------------------------+
| Name | ID | Project ID | User ID |
+------------+--------------------------------------+----------------------------------+----------------------------------+
| Workload_1 | 6639525d-736a-40c5-8133-5caaddaaa8e9 | 4224d3acfd394cc08228cc8072861a35 | 329880dedb4cd357579a3279835f392 |
| Workload_2 | 904e72f7-27bb-4235-9b31-13a636eb9c95 | 637a9ce3fd0d404cabf1a776696c9c04 | 329880dedb4cd357579a3279835f392 |
+------------+--------------------------------------+----------------------------------+----------------------------------+
List available projects on Target Cloud in the Target Domain
The identified orphaned workloads need to be assigned to their new projects. The following provides the list of all available projects viewable by the used admin-user in the target_domain.
# openstack project list --domain <target_domain>
+----------------------------------+----------+
| ID | Name |
+----------------------------------+----------+
| 01fca51462a44bfa821130dce9baac1a | project1 |
| 33b4db1099ff4a65a4c1f69a14f932ee | project2 |
| 9139e694eb984a4a979b5ae8feb955af | project3 |
+----------------------------------+----------+
List available users on the Target Cloud in the Target Project that have the right backup trustee role
To allow project owners to work with the workloads as well will they get assigned to a user with the backup trustee role that is existing in the target project.
Now that all informations have been gathered the workload can be reassigned to the target project.
# workloadmgr workload-reassign-workloads --new_tenant_id {target_project_id} --user_id {target_user_id} --workload_ids {workload_id} --migrate_cloud True
+-----------+--------------------------------------+----------------------------------+----------------------------------+
| Name | ID | Project ID | User ID |
+-----------+--------------------------------------+----------------------------------+----------------------------------+
| project1 | 904e72f7-27bb-4235-9b31-13a636eb9c95 | 4f2a91274ce9491481db795dcb10b04f | 3e05cac47338425d827193ba374749cc |
+-----------+--------------------------------------+----------------------------------+----------------------------------+
Verify the workload is available at the desired target_project
After the workload has been assigned to the new project it is recommended to verify the workload is managed by the Target Trilio and is assigned to the right project and user.
The reassigned workload can be restored using Horizon following the procedure described here.
This runbook will continue on the CLI only path.
Prepare the selective restore by getting the snapshot information
To be able to do the necessary selective restore a few pieces of information about the snapshot to be restored are required. The following process will provide all necessary information.
List all Snapshots of the workload to restore to identify the snapshot to restore
# workloadmgr snapshot-list --workload_id ac9cae9b-5e1b-4899-930c-6aa0600a2105 --all True
+----------------------------+--------------+--------------------------------------+--------------------------------------+---------------+-----------+-----------+
| Created At | Name | ID | Workload ID | Snapshot Type | Status | Host |
+----------------------------+--------------+--------------------------------------+--------------------------------------+---------------+-----------+-----------+
| 2019-11-02T02:30:02.000000 | jobscheduler | f5b8c3fd-c289-487d-9d50-fe27a6561d78 | ac9cae9b-5e1b-4899-930c-6aa0600a2105 | full | available | Upstream2 |
| 2019-11-03T02:30:02.000000 | jobscheduler | 7e39e544-537d-4417-853d-11463e7396f9 | ac9cae9b-5e1b-4899-930c-6aa0600a2105 | incremental | available | Upstream2 |
| 2019-11-04T02:30:02.000000 | jobscheduler | 0c086f3f-fa5d-425f-b07e-a1adcdcafea9 | ac9cae9b-5e1b-4899-930c-6aa0600a2105 | incremental | available | Upstream2 |
+----------------------------+--------------+--------------------------------------+--------------------------------------+---------------+-----------+-----------+
Get Snapshot Details with network details for the desired snapshot
Prepare the selective restore by creating the restore.json file
The selective restore is using a restore.json file for the CLI command. This restore.json file needs to be adjusted according to the desired restore.
{
u'description':u'<description of the restore>',
u'oneclickrestore':False,
u'restore_type':u'selective',
u'type':u'openstack',
u'name':u'<name of the restore>'
u'openstack':{
u'instances':[
{
u'name':u'<name instance 1>',
u'availability_zone':u'<AZ instance 1>',
u'nics':[ #####Leave empty for network topology restore
],
u'vdisks':[
{
u'id':u'<old disk id>',
u'new_volume_type':u'<new volume type name>',
u'availability_zone':u'<new cinder volume AZ>'
}
],
u'flavor':{
u'ram':<RAM in MB>,
u'ephemeral':<GB of ephemeral disk>,
u'vcpus':<# vCPUs>,
u'swap':u'<GB of Swap disk>',
u'disk':<GB of boot disk>,
u'id':u'<id of the flavor to use>'
},
u'include':<True/False>,
u'id':u'<old id of the instance>'
} #####Repeat for each instance in the snapshot
],
u'restore_topology':<True/False>,
u'networks_mapping':{
u'networks':[ #####Leave empty for network topology restore
]
}
}
}
Run the selective restore
To do the actual restore use the following command:
Reconfigure the Target Trilio installation back to the original one
After the Desaster Recovery Process has finished it is necessary to return the Trilio installation to its original configuration.
The following steps need to be done to completely reconfigure the Trilio installation.
During the reconfiguration process will all backups of the Target Region be on hold and it is not recommended to create new Backup Jobs until the Desaster Recovery Process has finished and the original Trilio configuration has been restored.
Delete NFS B2 to the Trilio Appliance Cluster
To add the NFS-Vol2 to the Trilio Appliance cluster the Trilio can either be fully reconfigured to use both NFS Volumes or it is possible to edit the configuration file and then restart all services. This procedure describes how to edit the conf file and restart the services. This needs to be repeated on every Trilio Appliance.
Trilio is integrating natively into the Openstack deployment tools. When using the Red Hat director or JuJu charms it is recommended to adapt the environment files for these orchestrators and update the Datamovers through them.
To add the NFS B2 to the Trilio Datamovers manually the tvault-contego.conf file needs to be edited and the service restarted.
After the Desaster Recovery Process has been successfully completed and the Trilio installation reconfigured to its original state, it is recommended to do the following additional steps to be ready for the next Disaster Recovery process.
Remove the database entry
The Trilio database is following the Openstack standard of not deleting any database entries upon deletion of the cloud object. Any Workload, Snapshot or Restore, which gets deleted, is marked as deleted only.
To allow the Trilio installation to be ready for another disaster recovery it is necessary to completely delete the entries of the Workloads, which have been restored.
Trilio does provide and maintain a script to safely delete workload entries and all connected entities from the Trilio database.
Trilio Workloads are designed to allow a Desaster Recovery without the need to backup the Trilio database.
As long as the Trilio Workloads are existing on the Backup Target Storage and a Trilio installation has access to them, it is possible to restore the Workloads.
This procedure is designed to be applicable to all Openstack installations using Trilio. It is to be used as a starting point to develop the exact Desaster Recovery process of a specific environment.
In case that instead of noticing the users, the workloads shall be restored is it necessary to have an User in each Project, that has the necessary privileges to restore.
Mount-paths
Trilio incremental Snapshots involve a backing file to the prior backup taken, which makes every Trilio incremental backup a synthetic full backup.
Trilio is using qcow2 backing files for this feature:
qemu-img info 85b645c5-c1ea-4628-b5d8-1faea0e9d549
image: 85b645c5-c1ea-4628-b5d8-1faea0e9d549
file format: qcow2
virtual size: 1.0G (1073741824 bytes)
disk size: 21M
cluster_size: 65536
backing file: /var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW0=/workload_3c2fbee5-ad90-4448-b009-5047bcffc2ea/snapshot_f4874ed7-fe85-4d7d-b22b-082a2e068010/vm_id_9894f013-77dd-4514-8e65-818f4ae91d1f/vm_res_id_9ae3a6e7-dffe-4424-badc-bc4de1a18b40_vda/a6289269-3e72-4085-adca-e228ba656984
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
As can be seen in the example is the backing file an absolute path, which makes it necessary, that this path exists so the backing files can be accessed.
Trilio is using the base64 hashing algorithm for the NFS mount-paths, to allow the configuration of multiple NFS Volumes at the same time. The hash value is calculated using the provided NFS path.
When the path of the backing file is not available on the Trilio appliance and Compute nodes, will the restores of incremental backups fail.
The tested and recommended method to make the backing files available is creating the required directory path and using mount --bind to make the path available for the backups.
#mount --bind <mount-path1> <mount-path2>
Running the mount --bind command will make the necessary path available until the next reboot. If it is required to have access to the path beyond a reboot is it necessary to edit the fstab.