In-place Restore

Overview

In-Place Restore allows you to restore an application to a previous known-good state by performing a complete cleanup of existing resources before restoring them from a backup. Unlike a standard restore — which creates resources alongside existing ones — In-Place Restore first removes all current application resources (both metadata and data) and then recreates everything from the backup.

This gives you a clean, consistent application state that exactly matches the backup, with no leftover, drifted, or corrupted resources.

When to Use In-Place Restore

In-Place Restore is designed for scenarios where you need your application returned to an exact previous state:

  • Application drift — Your application has been modified (scaled, reconfigured, patched) and you want to roll it back to a known-good backup.

  • Corruption recovery — Application resources are in a broken or inconsistent state and need to be fully reset.

  • Environment reset — You want to return a development, staging, or test environment to a clean baseline captured in a backup.

  • Failed upgrade rollback — An upgrade went wrong and you want to restore the application exactly as it was before the upgrade.

circle-exclamation

How It Works

In-Place Restore integrates into Trilio's existing restore workflow. When you enable the cleanupConfig on a Restore CR, the following happens:

Step 1: Validation

Before anything is modified, Trilio runs a validation job that performs the following checks:

  • Ensures the target namespace is not a critical system namespace (see Blocked Namespaces)

  • Validates that data components (PVCs) in the backup are not owned by workloads that were created after the backup was taken (see Data Component Ownership)

  • Performs standard restore validations (dry-run resource checks, transformation validation, storage provisioner compatibility)

If any validation check fails, the restore is rejected and no resources are modified.

Step 2: Data Preparation

Trilio creates the data components (PVCs) in an internal install namespace and copies backup data into them. This step happens before any cleanup, so your existing application remains untouched until data is ready.

Step 3: Cleanup

Trilio determines the appropriate cleanup strategy automatically based on your backup type:

  • Namespace backups → Namespace-level cleanup (removes all resources in the namespace)

  • Application-scoped backups (Helm, Operator, Custom) → Application-level cleanup (removes only the application's resources)

During cleanup, Trilio performs these ordered actions:

1

Webhooks removal

Webhooks are removed first to prevent admission webhooks from blocking subsequent deletions.

2

Workloads and custom resources deletion

Workloads and custom resources are deleted next, stopping controllers that could recreate resources.

3

Atomic infrastructure cleanup-and-restore

Infrastructure resources (ServiceAccounts, Secrets, ConfigMaps, Services, RBAC) are handled with an atomic cleanup-and-restore approach — each resource is deleted and immediately recreated from backup to avoid leaving infrastructure in a broken state.

4

Data resources cleanup

Data resources (PVCs and PVs) are cleaned up last.

All deletions use a two-phase approach:

  • First, a graceful deletion is attempted (standard Kubernetes delete).

  • If the resource is not deleted within the configured timeout (default: 60 seconds), Trilio automatically falls back to force deletion by removing finalizers and owner references.

This guarantees that cleanup always completes, even when resources are stuck.

Step 4: Restore

After cleanup, Trilio proceeds with its standard metadata restore process — recreating all application resources from the backup.

Step 5: Post-Restore Operations

Standard Trilio post-restore operations run as usual (unquiesce, add protection, cleanup restore artifacts).


Configuration

In-Place Restore is configured through the cleanupConfig field in the Restore CR spec.

Fields

Field
Type
Required
Default
Description

cleanupConfig.enabled

boolean

Yes

Set to true to enable In-Place Restore

cleanupConfig.gracefulDeletionTimeoutSeconds

integer

No

60

Seconds to wait for graceful deletion before falling back to force deletion. Range: 60–1800

Basic Example

Example with Custom Timeout

Example with Transformations

You can combine In-Place Restore with resource transformations. For example, to change the storage class of PVCs during restore:

Cluster-Scoped Restore Example

For cluster-scoped restores (multi-namespace), cleanupConfig can be set at the global level or per-namespace in the RestoreConfig:


Cleanup Strategies

Trilio automatically selects the cleanup strategy based on the backup type. You do not need to configure this — it is determined at runtime and recorded in the Restore CR status.

Namespace Cleanup

Used when the backup is a namespace-scoped backup.

Trilio cleans up all resources captured in the backup using a combination of application-specific handlers:

  • Helm releases are uninstalled using native helm uninstall

  • Custom and label-selected resources are cleaned up using categorical ordering

  • All remaining resources are cleaned up in a second pass

This approach ensures comprehensive cleanup even when a namespace contains a mix of Helm, Operator, and Custom applications.

Application Cleanup

Used when the backup is an application-scoped backup (Helm, Operator, or Custom).

Trilio cleans up only the specific application's resources as identified in the backup:

Application Type
Cleanup Approach

Helm

Native helm uninstall, followed by categorical cleanup of any remaining release resources, then data cleanup

Operator

Custom Resources deleted first (concurrent with retry logic), then application resources, then operator infrastructure, then data

OLM Operator

Same as Operator, but operator infrastructure (Deployments, ServiceAccounts managed by OLM) is preserved to avoid impacting other namespaces

Virtual Machine

VMPool → VM → DataVolume deletion in controller-safe order, then categorical cleanup for remaining resources

Custom (Label/GVK)

VM resources first (if any), then Custom Resources, then categorical cleanup, then data


Resource Handling

Resources That Are Automatically Excluded from Cleanup

Certain resources are never deleted during cleanup to prevent damage to shared cluster infrastructure:

Resource
Reason

CustomResourceDefinition (CRD)

Deleting a CRD removes all custom resources of that type across the entire cluster

StorageClass

Cluster-scoped infrastructure shared by multiple applications

OperatorGroup

Shared by multiple operators in the same namespace

CatalogSource

Shared OLM infrastructure for operator catalogs

Subscription

OLM-managed; deleting it disrupts operator lifecycle management

ClusterServiceVersion (CSV)

OLM-managed operator definition; removing it impacts all consumers

InstallPlan

OLM-managed; deleting it disrupts reconciliation

Namespace

Deleting the namespace would destroy the entire restore scope

These resources are automatically skipped during cleanup. If they still exist on the cluster after cleanup, they are also skipped during the restore phase.

Platform Default Services

The kubernetes and openshift default services are always skipped during cleanup. These services are managed by the Kubernetes and OpenShift control planes, and deleting them would disrupt communication to the API server.

Controller-Managed Shared Resources

Some resources are automatically recreated by Kubernetes or external controllers after deletion (for example, the kube-root-ca.crt ConfigMap or the default ServiceAccount). When Trilio deletes and immediately tries to recreate such a resource from backup, the controller may have already recreated it, causing a conflict.

Default behavior: The resource is skipped and a warning is recorded in the Restore CR status.

With patchIfAlreadyExists: true: The existing resource is patched with the backed-up state using a 3-way merge, ensuring it matches the backup while preserving any fields added by the controller.

For more details on patchIfAlreadyExists, see the Restore Flags Guidearrow-up-right.


Two-Phase Deletion

Every resource deletion during cleanup follows a two-phase approach:

  1. Graceful deletion: A standard Kubernetes delete request is sent. Trilio waits up to the configured timeout (default: 60 seconds) for the resource to be removed.

  2. Force deletion: If the resource still exists after the timeout, Trilio removes all finalizers and owner references from the resource and deletes it again. This ensures cleanup completes even when resources are stuck due to finalizers, webhook issues, or controller conflicts.

All force deletion actions are recorded as warnings in the Restore CR status (see Monitoring Cleanup Status), giving you full visibility into what happened.

Configuring the Timeout

The gracefulDeletionTimeoutSeconds field controls how long Trilio waits for graceful deletion before escalating to force deletion.

  • Default: 60 seconds

  • Minimum: 60 seconds

  • Maximum: 1800 seconds (30 minutes)

For most applications, the default of 60 seconds is sufficient. Increase this value if your application has resources with long-running finalizers or complex cleanup logic that needs more time.


Important Constraints

Blocked Namespaces

In-Place Restore with cleanup is not allowed in critical system namespaces. If you attempt to create a restore with cleanup enabled in any of these namespaces, the restore will be rejected during validation.

Platform
Blocked Namespaces

Kubernetes

kube-system, kube-public, kube-node-lease

Google Kubernetes Engine (GKE)

gke-system, gke-gmp-system, gke-managed-system, istio-system, config-management-system

Amazon EKS

amazon-cloudwatch, aws-observability, aws-load-balancer-controller, external-dns, karpenter

Azure AKS

gatekeeper-system, azure-arc, azure-monitor, calico-system, tigera-operator

OpenShift

openshift, openshift-apiserver, openshift-apiserver-operator, openshift-authentication, openshift-authentication-operator, openshift-cloud-controller-manager, openshift-cloud-network-config-controller, openshift-cluster-node-tuning-operator, openshift-cluster-storage-operator, openshift-config, openshift-config-managed, openshift-console, openshift-console-operator, openshift-controller-manager, openshift-controller-manager-operator, openshift-dns, openshift-dns-operator, openshift-etcd, openshift-image-registry, openshift-ingress, openshift-ingress-operator, openshift-kube-apiserver, openshift-kube-apiserver-operator, openshift-kube-controller-manager, openshift-kube-controller-manager-operator, openshift-kube-scheduler, openshift-kube-scheduler-operator, openshift-machine-api, openshift-machine-config-operator, openshift-monitoring, openshift-multus, openshift-network-diagnostics, openshift-network-operator, openshift-node, openshift-operator-lifecycle-manager, openshift-operators, openshift-ovn-kubernetes, openshift-service-ca, openshift-storage

Data Component Ownership Validation

When cleanup is enabled, Trilio validates the ownership of data components (PVCs) before proceeding. Specifically, for every PVC that exists both in the backup and on the cluster, Trilio checks the owner chain (the workload that mounts or owns the PVC). If any owner in the chain is a workload that was created after the backup was taken (i.e., it is not present in the backup), the restore will fail during validation.

This prevents Trilio from accidentally deleting PVCs that are now used by new workloads not covered by the backup.

Example: You took a backup of your namespace. Later, a new StatefulSet analytics-db was created that mounts an existing PVC shared-data (which is in the backup). If you attempt an In-Place Restore, validation will fail because analytics-db is not in the backup, and deleting shared-data would cause data loss for analytics-db.

Mutual Exclusivity with Resource Exclusions

In-Place Restore cannot be used together with resource exclusion selectors (excludeResources) in the same Restore CR. When cleanup is enabled, the entire backup scope is restored. If you need selective resource restoration, use a standard restore without cleanup.


Monitoring Cleanup Status

After a restore with cleanup completes (or fails), you can inspect the cleanup details in the Restore CR status.

Viewing Cleanup Status

Status Fields

The cleanupStatus section in the Restore CR status contains:

Field
Description

cleanupStrategy

The strategy used: Namespace or Application

forceCleanupWarnings

List of resources that required force deletion, with details

cleanupSummary.totalResources

Total number of resources processed during cleanup

cleanupSummary.gracefulCleanups

Number of resources deleted gracefully

cleanupSummary.forceCleanups

Number of resources that required force deletion

startTime

When the cleanup phase started

completionTime

When the cleanup phase completed

Example Status

In this example, 25 resources were cleaned up. 23 were deleted gracefully, and 2 required force deletion. The two force-deleted resources are listed with their names, types, and reasons.


Application Type Details

Helm Applications

When cleaning up a Helm application, Trilio first attempts a native helm uninstall to remove the release. If the uninstall fails or some resources remain, Trilio falls back to deleting individual resources in dependency order. After helm uninstall has already attempted graceful cleanup at the release level, individual remaining resources are force-deleted directly without waiting for a graceful timeout.

Data resources (PVCs and PVs) associated with the Helm release are cleaned up separately using the standard two-phase approach.

Operator Applications

Operator cleanup follows a structured five-phase approach:

1

Custom Resources

Deleted first using concurrent processing with retry logic. This prevents the operator controller from recreating resources during cleanup.

2

Application Resources

Resources owned by the Custom Resources are cleaned up in categorical order.

3

Helm Resources

If the operator was deployed via Helm, the Helm release is uninstalled.

4

Operator Resources

Core operator infrastructure (Deployments, ServiceAccounts, etc.) is cleaned up. This step is skipped for OLM operators (see below).

5

Data Resources

PVCs and PVs are cleaned up last.

OLM Operators

For operators managed by the Operator Lifecycle Manager (OLM), Trilio takes a conservative approach:

  • Cleaned: Custom Resources, application resources owned by CRs, and data resources

  • Preserved: OperatorGroup, CatalogSource, Subscription, ClusterServiceVersion, InstallPlan, and operator Deployments/ServiceAccounts

This is because OLM operators often serve multiple namespaces. Deleting operator infrastructure would impact all consumers of that operator, not just the application being restored.

Virtual Machine Applications

Virtual Machine resources are cleaned up in a specific order to prevent controllers from recreating resources:

1

VirtualMachinePool

Deleted first to prevent it from recreating VMs.

2

VirtualMachine

Deleted after the pool.

3

DataVolume

Deleted before PVCs.

Other VM-related resources (InstanceTypes, Preferences, ConfigMaps, Secrets) are handled through the standard categorical cleanup.


Best Practices

1

Take a fresh backup first

Always take a fresh backup before running In-Place Restore if you want the ability to roll forward. Once cleanup runs, the current application state is gone.

2

Start with default timeout

Start with the default timeout (60 seconds). Only increase gracefulDeletionTimeoutSeconds if you know your application has resources with long-running finalizer logic.

3

Check cleanup status

Check the cleanup status after restore by inspecting the Restore CR. Review any forceCleanupWarnings to understand which resources required force deletion and why.

4

Use patchIfAlreadyExists when needed

Use patchIfAlreadyExists: true if your application relies on controller-managed resources (like the default ServiceAccount or kube-root-ca.crt ConfigMap) and you want them patched to match the backup state rather than left as-is.

5

Avoid system namespaces

Do not use In-Place Restore in system namespaces. Trilio blocks critical namespaces automatically, but you should also avoid running cleanup in namespaces that contain shared infrastructure used by other applications.

6

Be aware of new workloads

Be aware of new workloads. If new workloads were created in the namespace after the backup was taken and they use PVCs from the backup, the restore will fail validation. Either remove the new workloads first, or take a new backup that includes them.

7

Combine with transformations if needed

Combine with transformComponents when needed. In-Place Restore supports transformations for modifying resources during restore (for example, changing storage classes or image references).


Frequently Asked Questions

chevron-rightQ: Will In-Place Restore delete resources that were created after the backup?hashtag

A: No. Regardless of whether the backup is namespace-scoped or application-scoped, In-Place Restore only cleans up resources that are present in the backup AND currently exist on the cluster. Resources that were created after the backup was taken are not touched during cleanup.

chevron-rightQ: What happens if cleanup fails partway through?hashtag

A: Infrastructure resources (ServiceAccounts, Secrets, ConfigMaps, etc.) use an atomic approach — they are deleted and immediately recreated from backup, so they are never left in a deleted state. For other resources, the restore will record the error in its status. Resources that were already deleted will need to be manually restored or a new restore attempt can be made.

chevron-rightQ: Does cleanup delete my CRDs?hashtag

A: No. CRDs are always excluded from cleanup because deleting a CRD would remove all custom resources of that type across the entire cluster, potentially impacting other applications.

chevron-rightQ: Can I preview what will be cleaned up before running In-Place Restore?hashtag

A: The validation phase checks for potential issues (blocked namespaces, data component ownership) and will reject the restore if problems are found. However, there is no dry-run mode for cleanup itself. You can review the backup contents to understand what resources will be affected.

chevron-rightQ: What is the difference between cleanupOnFailure and cleanupConfig?hashtag

A: These are different features:

  • cleanupOnFailure — Cleans up partially restored resources when a restore operation fails, reverting the cluster to its pre-restore state.

  • cleanupConfig — Cleans up existing resources before restoring, enabling a complete application reset to the backup state. This is the In-Place Restore feature.

chevron-rightQ: Does In-Place Restore work with cluster-scoped (multi-namespace) restores?hashtag

A: Yes. You can configure cleanupConfig at the global level in ClusterRestore to apply cleanup to all namespaces, or set it per-namespace for granular control.

Last updated