# In-place Restore

## Overview

In-Place Restore allows you to restore an application to a previous known-good state by performing a **complete cleanup** of existing resources before restoring them from a backup. Unlike a standard restore — which creates resources alongside existing ones — In-Place Restore first removes all current application resources (both metadata and data) and then recreates everything from the backup.

This gives you a clean, consistent application state that exactly matches the backup, with no leftover, drifted, or corrupted resources.

### When to Use In-Place Restore

In-Place Restore is designed for scenarios where you need your application returned to an exact previous state:

* **Application drift** — Your application has been modified (scaled, reconfigured, patched) and you want to roll it back to a known-good backup.
* **Corruption recovery** — Application resources are in a broken or inconsistent state and need to be fully reset.
* **Environment reset** — You want to return a development, staging, or test environment to a clean baseline captured in a backup.
* **Failed upgrade rollback** — An upgrade went wrong and you want to restore the application exactly as it was before the upgrade.

{% hint style="warning" %}
**Downtime expected**: In-Place Restore deletes all existing application resources before restoring them. Your application will be unavailable during the cleanup and restore phases. Plan accordingly and inform stakeholders before running an In-Place Restore on production workloads.
{% endhint %}

***

## How It Works

In-Place Restore integrates into Trilio's existing restore workflow. When you enable the `cleanupConfig` on a Restore CR, the following happens:

### Step 1: Validation

Before anything is modified, Trilio runs a validation job that performs the following checks:

* Ensures the target namespace is not a critical system namespace (see [Blocked Namespaces](#blocked-namespaces))
* Validates that data components (PVCs) in the backup are not owned by workloads that were created after the backup was taken (see [Data Component Ownership](#data-component-ownership-validation))
* Performs standard restore validations (dry-run resource checks, transformation validation, storage provisioner compatibility)

If any validation check fails, the restore is rejected and no resources are modified.

### Step 2: Data Preparation

Trilio creates the data components (PVCs) in an internal install namespace and copies backup data into them. This step happens before any cleanup, so your existing application remains untouched until data is ready.

### Step 3: Cleanup

Trilio determines the appropriate cleanup strategy automatically based on your backup type:

* **Namespace backups** → Namespace-level cleanup (removes all resources in the namespace)
* **Application-scoped backups** (Helm, Operator, Custom) → Application-level cleanup (removes only the application's resources)

During cleanup, Trilio performs these ordered actions:

{% stepper %}
{% step %}

### Webhooks removal

Webhooks are removed first to prevent admission webhooks from blocking subsequent deletions.
{% endstep %}

{% step %}

### Workloads and custom resources deletion

Workloads and custom resources are deleted next, stopping controllers that could recreate resources.
{% endstep %}

{% step %}

### Atomic infrastructure cleanup-and-restore

Infrastructure resources (ServiceAccounts, Secrets, ConfigMaps, Services, RBAC) are handled with an atomic cleanup-and-restore approach — each resource is deleted and immediately recreated from backup to avoid leaving infrastructure in a broken state.
{% endstep %}

{% step %}

### Data resources cleanup

Data resources (PVCs and PVs) are cleaned up last.
{% endstep %}
{% endstepper %}

All deletions use a **two-phase approach**:

* First, a graceful deletion is attempted (standard Kubernetes delete).
* If the resource is not deleted within the configured timeout (default: 60 seconds), Trilio automatically falls back to **force deletion** by removing finalizers and owner references.

This guarantees that cleanup always completes, even when resources are stuck.

### Step 4: Restore

After cleanup, Trilio proceeds with its standard metadata restore process — recreating all application resources from the backup.

### Step 5: Post-Restore Operations

Standard Trilio post-restore operations run as usual (unquiesce, add protection, cleanup restore artifacts).

***

## Configuration

In-Place Restore is configured through the `cleanupConfig` field in the Restore CR spec.

### Fields

| Field                                          | Type      | Required | Default | Description                                                                                 |
| ---------------------------------------------- | --------- | -------- | ------- | ------------------------------------------------------------------------------------------- |
| `cleanupConfig.enabled`                        | `boolean` | Yes      | —       | Set to `true` to enable In-Place Restore                                                    |
| `cleanupConfig.gracefulDeletionTimeoutSeconds` | `integer` | No       | `60`    | Seconds to wait for graceful deletion before falling back to force deletion. Range: 60–1800 |

### Basic Example

{% code title="restore.yaml" %}

```yaml
apiVersion: triliovault.trilio.io/v1
kind: Restore
metadata:
  name: inplace-restore
  namespace: my-app
spec:
  source:
    type: Backup
    backup:
      name: my-app-daily-backup
      namespace: my-app
  cleanupConfig:
    enabled: true
```

{% endcode %}

### Example with Custom Timeout

{% code title="restore-custom-timeout.yaml" %}

```yaml
apiVersion: triliovault.trilio.io/v1
kind: Restore
metadata:
  name: inplace-restore-custom-timeout
  namespace: my-app
spec:
  source:
    type: Backup
    backup:
      name: my-app-daily-backup
      namespace: my-app
  cleanupConfig:
    enabled: true
    gracefulDeletionTimeoutSeconds: 300
```

{% endcode %}

### Example with Transformations

You can combine In-Place Restore with resource transformations. For example, to change the storage class of PVCs during restore:

{% code title="restore-with-transform.yaml" %}

```yaml
apiVersion: triliovault.trilio.io/v1
kind: Restore
metadata:
  name: inplace-restore-with-transform
  namespace: my-app
spec:
  source:
    type: Backup
    backup:
      name: my-app-daily-backup
      namespace: my-app
  cleanupConfig:
    enabled: true
  transformComponents:
    custom:
      - transformName: update-storage-class
        resources:
          groupVersionKind:
            group: ""
            version: v1
            kind: PersistentVolumeClaim
        jsonPatches:
          - op: replace
            path: "/spec/storageClassName"
            value: "fast-ssd"
```

{% endcode %}

### Cluster-Scoped Restore Example

For cluster-scoped restores (multi-namespace), `cleanupConfig` can be set at the global level or per-namespace in the `RestoreConfig`:

{% code title="cluster-restore.yaml" %}

```yaml
apiVersion: triliovault.trilio.io/v1
kind: ClusterRestore
metadata:
  name: cluster-inplace-restore
spec:
  source:
    type: Backup
    clusterBackup:
      name: cluster-backup-20250115
      namespace: trilio-system
  restoreConfig:
    cleanupConfig:
      enabled: true
      gracefulDeletionTimeoutSeconds: 120
```

{% endcode %}

***

## Cleanup Strategies

Trilio automatically selects the cleanup strategy based on the backup type. You do not need to configure this — it is determined at runtime and recorded in the Restore CR status.

### Namespace Cleanup

Used when the backup is a **namespace-scoped backup**.

Trilio cleans up **all resources captured in the backup** using a combination of application-specific handlers:

* Helm releases are uninstalled using native `helm uninstall`
* Custom and label-selected resources are cleaned up using categorical ordering
* All remaining resources are cleaned up in a second pass

This approach ensures comprehensive cleanup even when a namespace contains a mix of Helm, Operator, and Custom applications.

### Application Cleanup

Used when the backup is an **application-scoped backup** (Helm, Operator, or Custom).

Trilio cleans up **only the specific application's resources** as identified in the backup:

| Application Type       | Cleanup Approach                                                                                                                             |
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| **Helm**               | Native `helm uninstall`, followed by categorical cleanup of any remaining release resources, then data cleanup                               |
| **Operator**           | Custom Resources deleted first (concurrent with retry logic), then application resources, then operator infrastructure, then data            |
| **OLM Operator**       | Same as Operator, but operator infrastructure (Deployments, ServiceAccounts managed by OLM) is preserved to avoid impacting other namespaces |
| **Virtual Machine**    | VMPool → VM → DataVolume deletion in controller-safe order, then categorical cleanup for remaining resources                                 |
| **Custom (Label/GVK)** | VM resources first (if any), then Custom Resources, then categorical cleanup, then data                                                      |

***

## Resource Handling

### Resources That Are Automatically Excluded from Cleanup

Certain resources are **never deleted** during cleanup to prevent damage to shared cluster infrastructure:

| Resource                       | Reason                                                                             |
| ------------------------------ | ---------------------------------------------------------------------------------- |
| CustomResourceDefinition (CRD) | Deleting a CRD removes all custom resources of that type across the entire cluster |
| StorageClass                   | Cluster-scoped infrastructure shared by multiple applications                      |
| OperatorGroup                  | Shared by multiple operators in the same namespace                                 |
| CatalogSource                  | Shared OLM infrastructure for operator catalogs                                    |
| Subscription                   | OLM-managed; deleting it disrupts operator lifecycle management                    |
| ClusterServiceVersion (CSV)    | OLM-managed operator definition; removing it impacts all consumers                 |
| InstallPlan                    | OLM-managed; deleting it disrupts reconciliation                                   |
| Namespace                      | Deleting the namespace would destroy the entire restore scope                      |

These resources are automatically skipped during cleanup. If they still exist on the cluster after cleanup, they are also skipped during the restore phase.

### Platform Default Services

The `kubernetes` and `openshift` default services are **always skipped** during cleanup. These services are managed by the Kubernetes and OpenShift control planes, and deleting them would disrupt communication to the API server.

### Controller-Managed Shared Resources

Some resources are automatically recreated by Kubernetes or external controllers after deletion (for example, the `kube-root-ca.crt` ConfigMap or the `default` ServiceAccount). When Trilio deletes and immediately tries to recreate such a resource from backup, the controller may have already recreated it, causing a conflict.

**Default behavior**: The resource is skipped and a warning is recorded in the Restore CR status.

**With `patchIfAlreadyExists: true`**: The existing resource is patched with the backed-up state using a 3-way merge, ensuring it matches the backup while preserving any fields added by the controller.

For more details on `patchIfAlreadyExists`, see the [Restore Flags Guide](https://docs.trilio.io/kubernetes/getting-started/using-trilio/getting-started-with-management-console/index/restoring-backups/restore-flags-guide#patchifalreadyexists).

***

## Two-Phase Deletion

Every resource deletion during cleanup follows a two-phase approach:

1. **Graceful deletion**: A standard Kubernetes delete request is sent. Trilio waits up to the configured timeout (default: 60 seconds) for the resource to be removed.
2. **Force deletion**: If the resource still exists after the timeout, Trilio removes all finalizers and owner references from the resource and deletes it again. This ensures cleanup completes even when resources are stuck due to finalizers, webhook issues, or controller conflicts.

All force deletion actions are recorded as **warnings** in the Restore CR status (see [Monitoring Cleanup Status](#monitoring-cleanup-status)), giving you full visibility into what happened.

### Configuring the Timeout

The `gracefulDeletionTimeoutSeconds` field controls how long Trilio waits for graceful deletion before escalating to force deletion.

* **Default**: 60 seconds
* **Minimum**: 60 seconds
* **Maximum**: 1800 seconds (30 minutes)

For most applications, the default of 60 seconds is sufficient. Increase this value if your application has resources with long-running finalizers or complex cleanup logic that needs more time.

***

## Important Constraints

### Blocked Namespaces

In-Place Restore with cleanup is **not allowed** in critical system namespaces. If you attempt to create a restore with cleanup enabled in any of these namespaces, the restore will be rejected during validation.

| Platform                           | Blocked Namespaces                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Kubernetes**                     | `kube-system`, `kube-public`, `kube-node-lease`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| **Google Kubernetes Engine (GKE)** | `gke-system`, `gke-gmp-system`, `gke-managed-system`, `istio-system`, `config-management-system`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| **Amazon EKS**                     | `amazon-cloudwatch`, `aws-observability`, `aws-load-balancer-controller`, `external-dns`, `karpenter`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| **Azure AKS**                      | `gatekeeper-system`, `azure-arc`, `azure-monitor`, `calico-system`, `tigera-operator`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| **OpenShift**                      | `openshift`, `openshift-apiserver`, `openshift-apiserver-operator`, `openshift-authentication`, `openshift-authentication-operator`, `openshift-cloud-controller-manager`, `openshift-cloud-network-config-controller`, `openshift-cluster-node-tuning-operator`, `openshift-cluster-storage-operator`, `openshift-config`, `openshift-config-managed`, `openshift-console`, `openshift-console-operator`, `openshift-controller-manager`, `openshift-controller-manager-operator`, `openshift-dns`, `openshift-dns-operator`, `openshift-etcd`, `openshift-image-registry`, `openshift-ingress`, `openshift-ingress-operator`, `openshift-kube-apiserver`, `openshift-kube-apiserver-operator`, `openshift-kube-controller-manager`, `openshift-kube-controller-manager-operator`, `openshift-kube-scheduler`, `openshift-kube-scheduler-operator`, `openshift-machine-api`, `openshift-machine-config-operator`, `openshift-monitoring`, `openshift-multus`, `openshift-network-diagnostics`, `openshift-network-operator`, `openshift-node`, `openshift-operator-lifecycle-manager`, `openshift-operators`, `openshift-ovn-kubernetes`, `openshift-service-ca`, `openshift-storage` |

### Data Component Ownership Validation

When cleanup is enabled, Trilio validates the ownership of data components (PVCs) before proceeding. Specifically, for every PVC that exists both in the backup and on the cluster, Trilio checks the owner chain (the workload that mounts or owns the PVC). If any owner in the chain is a workload that was created **after** the backup was taken (i.e., it is not present in the backup), the restore will **fail during validation**.

This prevents Trilio from accidentally deleting PVCs that are now used by new workloads not covered by the backup.

Example: You took a backup of your namespace. Later, a new StatefulSet `analytics-db` was created that mounts an existing PVC `shared-data` (which is in the backup). If you attempt an In-Place Restore, validation will fail because `analytics-db` is not in the backup, and deleting `shared-data` would cause data loss for `analytics-db`.

### Mutual Exclusivity with Resource Exclusions

In-Place Restore **cannot be used together** with resource exclusion selectors (`excludeResources`) in the same Restore CR. When cleanup is enabled, the entire backup scope is restored. If you need selective resource restoration, use a standard restore without cleanup.

***

## Monitoring Cleanup Status

After a restore with cleanup completes (or fails), you can inspect the cleanup details in the Restore CR status.

### Viewing Cleanup Status

```bash
kubectl get restore <restore-name> -n <namespace> -o yaml
```

### Status Fields

The `cleanupStatus` section in the Restore CR status contains:

| Field                             | Description                                                  |
| --------------------------------- | ------------------------------------------------------------ |
| `cleanupStrategy`                 | The strategy used: `Namespace` or `Application`              |
| `forceCleanupWarnings`            | List of resources that required force deletion, with details |
| `cleanupSummary.totalResources`   | Total number of resources processed during cleanup           |
| `cleanupSummary.gracefulCleanups` | Number of resources deleted gracefully                       |
| `cleanupSummary.forceCleanups`    | Number of resources that required force deletion             |
| `startTime`                       | When the cleanup phase started                               |
| `completionTime`                  | When the cleanup phase completed                             |

### Example Status

{% code title="restore-status.yaml" %}

```yaml
status:
  cleanupStatus:
    cleanupStrategy: Namespace
    forceCleanupWarnings:
    - resourceName: "mysql-deployment"
      resourceType: "apps/v1/Deployment"
      reason: "Resource stuck with custom finalizer after graceful deletion timeout"
      timestamp: "2025-01-15T10:30:15Z"
    - resourceName: "shared-configmap"
      resourceType: "v1/ConfigMap"
      reason: "Shared resource force deleted — will be recreated from backup"
      timestamp: "2025-01-15T10:30:20Z"
    cleanupSummary:
      totalResources: 25
      gracefulCleanups: 23
      forceCleanups: 2
    startTime: "2025-01-15T10:28:00Z"
    completionTime: "2025-01-15T10:35:45Z"
```

{% endcode %}

In this example, 25 resources were cleaned up. 23 were deleted gracefully, and 2 required force deletion. The two force-deleted resources are listed with their names, types, and reasons.

***

## Application Type Details

### Helm Applications

When cleaning up a Helm application, Trilio first attempts a native `helm uninstall` to remove the release. If the uninstall fails or some resources remain, Trilio falls back to deleting individual resources in dependency order. After `helm uninstall` has already attempted graceful cleanup at the release level, individual remaining resources are force-deleted directly without waiting for a graceful timeout.

Data resources (PVCs and PVs) associated with the Helm release are cleaned up separately using the standard two-phase approach.

### Operator Applications

Operator cleanup follows a structured five-phase approach:

{% stepper %}
{% step %}

### Custom Resources

Deleted first using concurrent processing with retry logic. This prevents the operator controller from recreating resources during cleanup.
{% endstep %}

{% step %}

### Application Resources

Resources owned by the Custom Resources are cleaned up in categorical order.
{% endstep %}

{% step %}

### Helm Resources

If the operator was deployed via Helm, the Helm release is uninstalled.
{% endstep %}

{% step %}

### Operator Resources

Core operator infrastructure (Deployments, ServiceAccounts, etc.) is cleaned up. This step is **skipped for OLM operators** (see below).
{% endstep %}

{% step %}

### Data Resources

PVCs and PVs are cleaned up last.
{% endstep %}
{% endstepper %}

### OLM Operators

For operators managed by the Operator Lifecycle Manager (OLM), Trilio takes a conservative approach:

* **Cleaned**: Custom Resources, application resources owned by CRs, and data resources
* **Preserved**: OperatorGroup, CatalogSource, Subscription, ClusterServiceVersion, InstallPlan, and operator Deployments/ServiceAccounts

This is because OLM operators often serve multiple namespaces. Deleting operator infrastructure would impact all consumers of that operator, not just the application being restored.

### Virtual Machine Applications

Virtual Machine resources are cleaned up in a specific order to prevent controllers from recreating resources:

{% stepper %}
{% step %}

### VirtualMachinePool

Deleted first to prevent it from recreating VMs.
{% endstep %}

{% step %}

### VirtualMachine

Deleted after the pool.
{% endstep %}

{% step %}

### DataVolume

Deleted before PVCs.
{% endstep %}
{% endstepper %}

Other VM-related resources (InstanceTypes, Preferences, ConfigMaps, Secrets) are handled through the standard categorical cleanup.

***

## Best Practices

{% stepper %}
{% step %}

### Take a fresh backup first

Always take a fresh backup before running In-Place Restore if you want the ability to roll forward. Once cleanup runs, the current application state is gone.
{% endstep %}

{% step %}

### Start with default timeout

Start with the default timeout (60 seconds). Only increase `gracefulDeletionTimeoutSeconds` if you know your application has resources with long-running finalizer logic.
{% endstep %}

{% step %}

### Check cleanup status

Check the cleanup status after restore by inspecting the Restore CR. Review any `forceCleanupWarnings` to understand which resources required force deletion and why.
{% endstep %}

{% step %}

### Use patchIfAlreadyExists when needed

Use `patchIfAlreadyExists: true` if your application relies on controller-managed resources (like the `default` ServiceAccount or `kube-root-ca.crt` ConfigMap) and you want them patched to match the backup state rather than left as-is.
{% endstep %}

{% step %}

### Avoid system namespaces

Do not use In-Place Restore in system namespaces. Trilio blocks critical namespaces automatically, but you should also avoid running cleanup in namespaces that contain shared infrastructure used by other applications.
{% endstep %}

{% step %}

### Be aware of new workloads

Be aware of new workloads. If new workloads were created in the namespace after the backup was taken and they use PVCs from the backup, the restore will fail validation. Either remove the new workloads first, or take a new backup that includes them.
{% endstep %}

{% step %}

### Combine with transformations if needed

Combine with `transformComponents` when needed. In-Place Restore supports transformations for modifying resources during restore (for example, changing storage classes or image references).
{% endstep %}
{% endstepper %}

***

## Frequently Asked Questions

<details>

<summary>Q: Will In-Place Restore delete resources that were created after the backup?</summary>

A: No. Regardless of whether the backup is namespace-scoped or application-scoped, In-Place Restore **only cleans up resources that are present in the backup AND currently exist on the cluster**. Resources that were created after the backup was taken are not touched during cleanup.

</details>

<details>

<summary>Q: What happens if cleanup fails partway through?</summary>

A: Infrastructure resources (ServiceAccounts, Secrets, ConfigMaps, etc.) use an atomic approach — they are deleted and immediately recreated from backup, so they are never left in a deleted state. For other resources, the restore will record the error in its status. Resources that were already deleted will need to be manually restored or a new restore attempt can be made.

</details>

<details>

<summary>Q: Does cleanup delete my CRDs?</summary>

A: No. CRDs are always excluded from cleanup because deleting a CRD would remove all custom resources of that type across the entire cluster, potentially impacting other applications.

</details>

<details>

<summary>Q: Can I preview what will be cleaned up before running In-Place Restore?</summary>

A: The validation phase checks for potential issues (blocked namespaces, data component ownership) and will reject the restore if problems are found. However, there is no dry-run mode for cleanup itself. You can review the backup contents to understand what resources will be affected.

</details>

<details>

<summary>Q: What is the difference between <code>cleanupOnFailure</code> and <code>cleanupConfig</code>?</summary>

A: These are different features:

* `cleanupOnFailure` — Cleans up **partially restored** resources when a restore operation **fails**, reverting the cluster to its pre-restore state.
* `cleanupConfig` — Cleans up **existing** resources **before** restoring, enabling a complete application reset to the backup state. This is the In-Place Restore feature.

</details>

<details>

<summary>Q: Does In-Place Restore work with cluster-scoped (multi-namespace) restores?</summary>

A: Yes. You can configure `cleanupConfig` at the global level in `ClusterRestore` to apply cleanup to all namespaces, or set it per-namespace for granular control.

</details>
