Trilio 3.0 Release Notes

Trilio release 3.0 introduces new features and capabilities including support for S3 storage targets, capturing tenant’s networking topology, expanded lifecycle cloud management support with Red Hat Director, and more.

Support for S3 storage targets

Since the introduction of Amazon S3, object storage is quickly becoming storage of choice for cloud platforms. Object storage offers very reliable, infinitely scalable storage using cheap hardware. Object storage has become storage of choice for archival, backup and disaster recovery, web hosting, documentation and other use cases. Trilio incorporates Linux’s Filesystem in Userspace (FUSE) with patent pending processing to optimize data handling using the object store. With that, Trilio maintains the same functionality in using S3 as with using the NFS backup target including:

  • Incremental forever

  • Snapshot retention policy with automatic retirement

  • Synthetically full, mountable snapshots

  • Efficient restores with minimum requirement of staging area

  • Scalable solution that linearly scales with compute nodes without adding any performance or data bandwidth bottlenecks

Backup and restore of tenant’s private network topology

Another milestone achievement in release 3.0, is the ability to protect tenant’s network space. With this, Trilio helps tenants recover the entire network topology including:

  • Networks

  • Subnets

  • Routers

  • Static Routes

  • Ports

  • Floating IP’s

Taking advantage of this additional backup could not be any simpler, as tenants have nothing to do! The entire tenant’s network topology information is automatically included in every snapshot of every workload. This ensures the data is there when needed, eliminates the risk of human error in configuring another protection aspect and keeps it simple. For recovery, tenants may respectively use a point-in-time snapshot from any workload. A new option under Selective Restore is added to restore the network topology. Trilio will recreate the entire tenant network topology from scratch, to exactly the way it was at the time of backup. It will define the private networks with their subnets, recreate the routers, add the correct interfaces to each router and add static routes to the router if applicable. An important consideration in restoring tenants’ networks is that their public network interface may very well have changed. This is always the case in a disaster recovery scenario. For that reason, Trilio will stop short of connecting the new private networks to the public one, allowing tenants to take this last step manually. Note:

  • To eliminate conflicts, the tenant’s space must have no networking components defined. The restore will fail if any conflict is found, and the network will be reinstated to what it was prior to the attempted restore.

  • As always, Network Topology restore is fully enabled programmatically as well as through the GUI.

New High Availability Cluster Architecture with easier than ever Configurator

Starting with release 3.0, Trilio is deployed using a built-in high availability (HA) cluster architecture, supporting a single node or a three-node cluster. The three-node cluster is the recommended best practice for fault tolerance and load balancing. The deployment is HA ready even with a single node, allowing to expand to three nodes at a later time. For that reason, Trilio requires an additional IP for the cluster even in a single node deployment. The cluster IP (aka virtual IP, VIP) is used for managing the HA cluster and is used to register the Trilio service endpoint in the keystone service catalog. The Trilio installation and deployment process handles all the necessary software (e.g. HAProxy) so users don’t have to manage it on their own. The TVM nodes cannot be installed as VMs under the same OpenStack cloud being protected. They need to be outside of OpenStack on one or more independent KVM hosts. Ideally these KVM hosts would be managed as a virtualized infrastructure using oVirt/RHV, virt-manager or other management tools Configuration GUI The centralized deployment feature is accompanied by a new and improved GUI featuring a Grafana based dashboard, easy to view and modify configuration details, and easy to view Ansible outputs with collapsible level of information

Expanded lifecycle cloud management support

Red Hat director integration

Red Hat OpenStack Platform (RHOSP) director integration allows customers to deploy Trilio using the same lifecycle management tool they use for the cloud itself. The integration supports both cases where the overcloud is deployed for the first time or is already deployed. Release 3.0 supports RHOSP version long-lived version 10. Long-lived version 13 is expected to follow soon.

Mirantis distribution with Debian packaging

Mirantis field personnel and customers who are looking to deploy Trilio 3.0 can now do this through familiar Ubuntu package management tools.

Improved Ansible Automation

The Trilio configuration process has been completely rearchitected using ansible scripts. Ansible, in the last few years, has grown in popularity as a preferred configuration management tool and Trilio uses Ansible play books extensively to configure the Trilio cluster. Ansible modules are inherently idempotent and hence Trilio configuration can run any number of times to change or reconfigure the Trilio cluster.

Enhancement Requests

Release 3.0 includes the following requested enhancements:

#

Reference

Description

Resolution

1

MCP-TV-RQ1

Passive provisioning of Keystone catalog records (Eliminate requirement for Admin privileges in managing endpoints in Keystone catalog)

While registering API endpoints, TVault now checks whether respective service and endpoints are already present, and does not override them if they are. The requirement for Admin privilege s has been eliminated.

2

MCP-TV-RQ2

APT packaging of Trilio extensions for OpenStack

Debian packaging is now supported.

3

MCP-TV-RQ3

REST API endpoint for Trilio Controller service configuration

Configurator API documentation has been added to the deployment guide

Deprecated Functionality

#

Topic

Description

Alternative

1

Swift target

With the introduction of S3 support, we have deprecated Swift as target and it is no longer supported.

This is due to multiple performance challenges combined with declining demand for Swift based systems.

NFS and S3

Known Issues

This release contains the following known issues which are tracked for a future update.

Description

Workaround

if wlm-workloads service is stopped on primary node, restore will get stuck.

If wlm-workloads service is stopped on the primary node, then restore will remain in restoring state. Later, if wlm-workloads service is restarted, restore fails with error “Restore did not finish successfully”

Restart wlm-workloads service of that node

Global job scheduler status fluctuates

  1. Change parameter “global_job_scheduler_override”

    present in workloadmgr.conf to True on all the nodes.

  2. restart wlm-api on all nodes.

If virtual IP switched over during snapshot creation then snapshot remains in ‘executing status’

  1. Restart RabbitMQ on secondary nodes

  2. Restart wlm-workloads on secondary nodes

Errors “OSError: [Errno 2] No such file or directory” may be observed during snapshot creation with NFS backup target

  1. Append “lookupcache=none” against “vault_storage_nfs_options” parameter in /etc/tvault-contego/t vault-contego.conf on OpenStack compute nodes and /etc/workloadmgr /workloadmgr.conf on TVM nodes.

  2. Restart tvault-contego service on all compute nodes and wlm-api service on all TVM nodes.

On some browsers, the Grafana panel of the Configurator asks for security permissions

Open a new tab with https://virtualip:3001 and add the ssl exception to get the dashboard working.

RabbitMQ: Data Replication failed after primary node goes into standby and reverts back to active mode

  1. Restart RabbitMQ on secondary nodes

  2. Restart wlm-workloads on secondary nodes

“Volume type mapping” missing from selective restore when browsed from restore tab in UI

Option is visible when it is opened from Project/Backups /Workloads/Snap shots drop down list option for “Selective Restore”.

TVault reconfiguration may fail after deleting existing TVM node and adding newly created TVM node.

The following error will be shown

fatal: [TVM_3]: FAILED! =>

{“changed”: false, “msg”: “Unable to restart service MySQL: Job for mariadb.servic e failed because a timeout was exceeded. See "systemctl status mariadb.servic e” and "journalctl -xe” for details.n”}

  1. Reinitialize Database from UI.

  2. On all TVault nodes : rm /etc/galera_clu ster_configured

  3. Reconfigure with valid values

Galera may become inconsistent if reconfiguring without reinitializing the database

1) Delete the file “/etc/galera_cl uster_configure d” from all three nodes 2) Re-initilaize TVault

Snapshot remains in executing/uploa ding state if wlm-workload s service is stopped on the node where the snapshot got scheduled. No error is shown

Restart wlm-workloads service of that node

TVault reconfiguration might fail intermittently at configuring Trilio cluster and cause the cluster to go into inconsistent state.

1. Reinitialize Database from UI. 2. On all TVault nodes : rm /etc/galera_clu ster_configured 3. Reconfigure with valid values

If a network port goes down for any node in a multi-node setup, pacemaker service gets stopped on that node. When the network port comes back up, the node fails to join cluster.

Restart the pacemaker service of that particular node.

Network restore does not proceed if there is no network available on setup UI.

Proceed for network restore with CLI

After upgrading to 3.0 release, email settings are not imported.

Manually configure the email settings

When using Red Hat director with an existing Trilio deployment, the existing deployment must be cleaned up manually before the upgrade

  1. Uninstall all old tvault-conte go-api, tvault-horiz on-api, python-workl oadmgclient pip packages from all controller nodes

  2. Uninstall all tvault-contego extension pip package and clean /home/tvault directory on all compute nodes

  3. Make sure /usr/lib/pyt hon2.7/site-pac kages/ directory does not have any old egg-info directories for tvault packages on all overcloud nodes(compute and controller nodes)

In Horizon UI, backups admin nodes tab, a node may not be visible.

Login to that particular node and restart wlm-workloads service.

Last updated