Backup Retention Process

This section describes Trilio for Kubernetes retention process for backup images.

A Kubernetes application may have multiple PVs, but the Trilio retention policy works at PV level. In this document, we will discuss Trilio's implementation of retention policy. Before we dive into the internals of a retention processes, there are a few things to keep in mind about Trilio backup images:

  1. All backup images are QCOW2 images. Full backups are base images and incremental backups are overlay file. Overlay files have backing file reference to previous backups. The number of overlay files depends on number of incremental backups with the latest overlay file represents the latest backup file.

  2. Trilio supports forever incrementals. Incremental backups are efficient from both network bandwidth usage perspective and data storage perspective. Once a full backup is taken, user does not need to take another full backup, improving overall backup process efficiency.

  3. A Synthetic full backup is the process of creating full backup image at the backup target by combining one or more backup images. This feature avoids the need to take full backups. Trilio supports synthetic full backups and improves the backup process performance.

PV Retention Policy

The following section describes the retention policy per PV. Let's discuss the retention policy in following scenarios.

Backups to Retain:3, Forever Incremental

Let's assume that the number of snapshots to retain is three (3) and the backup policy is set to "forever incremental". Four days backup looks as follows:

Day 1:

Day 2:

Day 3:

Day 4:

Backups to retain: Three (3), Full backups after every two backups

Day 1:

Day 2:

Day 3:

Day 4:

Day 5:

In the last two scenarios, the retention policy is implemented on a PV basis. For a complex application that has many more PVs, we may get into some interesting scenarios.

Whether to perform a full backup or incremental backup is first determined by the backup policy that is chosen at the time of backup job creation. The first backup always full and the subsequent backup type depends on the backup policy. If a user chooses a full backup after every three backups, the backup types look like:

F ← I ← I F ← I ← I F ← I ← I

When F is for Full and I is for Incremental.

However if a new PV is added to an application in between two backups, the backup for the newly added PV must be full backup and incremental for all existing PVs. The retention policy for each PV still follows above algorithm.

Whether to perform a full backup or incremental backup also depends on whether a CSI snapshot exists for the given PV. For newly added PVs, Trilio creates first CSI snapshot to ensure a full backup. The above check also works if a user deletes the Trilio generated CSI snapshot for a given PV. In this case the next backup for the PV must be full backup.

Retention Policy Pseudo Code

Backup in this discussion is the application backup that Trilio takes. A PV snapshot is what CSI accomplishes. We also assume that the backup CRD will have following fields. In actuality it may have many more, but the retention algorithm relies on these fields.

TrilioBackup:
  name:  # name of the snapshot
  size: # size of the snapshot in bytes. This includes all PVs backup size. Editable by Trilio
  backup_type: # full or incremental. Only editable by Trilio
  pvs:  # list of PVs
    pv1: # PV1
      type: # type of backup, full or incremental. Only editable by Trilio
      size: # size of backup. Only editable by Trilio
      backup_location: # location of the backup file
      pv_snapshot: #csi snapshot of the PV that correspond to this backup

Assuming that the latest backup id is day4 and number of backups to retain is three(3): iterate through backups and identify the list of backups that are more than three (3). backups in an err state are not counted into the backups to retain. The oldest backup to retain is backup_to_retain. Any backups older than backup_to_retain should be merged with backup_to_retain. Let this list be backups_to_merge in time sorted order. Backups in the list are merged with the top of the list.

Assuming that there are five (5) backups and backup retention policy is set to three(3). The following diagram represents backup_to_retain and backups_to_merge.

delete_backup = backup_to_retain is full

backups_to_commit = [backup_to_retain]

for backup in backups_to_merge:
    if delete_backup:
       delete backup
     else:
       if backup is full:
          delete_backup = True
          backups_to_commit.append(backup)

for backup in backups_to_commit:
    for pv in backup:
        qemu-img commit pv_disk_image

for pv in backup_to_retain:
    for backup in backups_to_commit:
        for pv1 in backup:
            if pv1.id is not pv.id:
                continue
            if pv1 is full backup:
                rename pv1 disk image to pv disk image
                break

mark backup_to_retain as full backup
update the backup size to full backup size

for pv in backup_to_retain:
    mark pv backup image as full
    update pv size to full backup image size

for backup in snaps_to_commit:
    delete backup

Delete a Backup

Deleting backups can be bit tricky. When a backup is deleted, Trilio generally does not try to delete corresponding images in the backup media. When the retention policy is engaged, the retention algorithm will consolidate the backup images.

Assuming here is the scenario:

The above application has three PVs. The PVs are added at different intervals of time and you can see PV1 has four (4) backups and PV2 has three (3) backups and PV3 has two (2) backups.

Now let's say a user decides to delete the Day1 backup. We will mark the corresponding Trilio backup object as deleted but will not delete the underlying backup image. Deleting the day1 backup of PV1 will break the chain and day2 backup becomes unusable. Instead we will leave the chain unchanged during backup delete operation. When the next retention algorithm is executed and for example the retention policy is set to four (4), the retention algorithm commits the day2 overlay file to the day1 and then renames day1 backup to day2.

If a user chooses to delete the day4 snapshot, then Trilio deletes backup images from the backup media.

Errored Backups

If a backup of the application fails, then that backup images should not appear in any of the PVs backup images chain. The backup of an application may fail for various reasons and at various points of the backup process. In the above example, one of the PV backup failed to upload the data to the backup media. In that case none of the PVs backup chains should contain any backup images from this backup. Furthermore, the CSI snapshot of the PV of the last known good backup should be preserved and any CSI snapshots that were created for the current backup job should be "cleaned up" (eliminated). When the next backup is scheduled, incremental backups are generated with respect to the last known good backup job of the PV CSI snapshots.

Retention for Immutable Backups

Immutable backups handle retention differently than standard Trilio backups. Since these backups cannot be altered or modified, the retention process surrounding them is unique. Each full backup has a maximum number of incremental backups, with the expiration of any backup corresponding to the expiration of the last backup in the entire chain.

When determining the retention period for a full backup, consider both the applied schedule policy and the maximum number of incremental backups per full backup (MaxIncrBackupsPerFullBackup). By taking into account these factors, you can calculate when the next full backup will occur and the expiration of the previous backup chain. Essentially, the retention period for all backups within the chain will be the same and equal to the expiration of the last incremental backup.

The retention job will not remove backups from the target storage as it would with standard backups; it will only delete the backup Custom Resource (CR) from the cluster once the expiration date is reached. Deletion of the actual backups from the target storage will be managed by the retention period set on the S3 bucket.