A Kubernetes application may have multiple PVs, but the TrilioVault retention policy works at PV level. In this document, we will discuss TrilioVault's implementation of retention policy. Before we dive into the internals of a retention processes, there are a few things to keep in mind about TrilioVault backup images:
All backup images are QCOW2 images. Full backups are base images and incremental backups are overlay file. Overlay files have backing file reference to previous backups. The number of overlay files depends on number of incremental backups with the latest overlay file represents the latest backup file.
TrilioVault supports forever incrementals. Incremental backups are efficient from both network bandwidth usage perspective and data storage perspective. Once a full backup is taken, user does not need to take another full backup, improving overall backup process efficiency.
A Synthetic full backup is the process of creating full backup image at the backup target by combining one or more backup images. This feature avoids the need to take full backups. TrilioVault supports synthetic full backups and improves the backup process performance.
The following section describes the retention policy per PV. Let's discuss the retention policy in following scenarios.
Let's assume that the number of snapshots to retain is three (3) and the backup policy is set to "forever incremental". Four days backup looks as follows:
In the last two scenarios, the retention policy is implemented on a PV basis. For a complex application that has many more PVs, we may get into some interesting scenarios.
Whether to perform a full backup or incremental backup is first determined by the backup policy that is chosen at the time of backup job creation. The first backup always full and the subsequent backup type depends on the backup policy. If a user chooses a full backup after every three backups, the backup types look like:
F ← I ← I F ← I ← I F ← I ← I
When F is for Full and I is for Incremental.
However if a new PV is added to an application in between two backups, the backup for the newly added PV must be full backup and incremental for all existing PVs. The retention policy for each PV still follows above algorithm.
Whether to perform a full backup or incremental backup also depends on whether a CSI snapshot exists for the given PV. For newly added PVs, TrilioVault creates first CSI snapshot to ensure a full backup. The above check also works if a user deletes the TrilioVault generated CSI snapshot for a given PV. In this case the next backup for the PV must be full backup.
Backup in this discussion is the application backup that TrilioVault takes. A PV snapshot is what CSI accomplishes. We also assume that the backup CRD will have following fields. In actuality it may have many more, but the retention algorithm relies on these fields.
TrilioBackup:name: # name of the snapshotsize: # size of the snapshot in bytes. This includes all PVs backup size. Editable by TrilioVaultbackup_type: # full or incremental. Only editable by TrilioVaultpvs: # list of PVspv1: # PV1type: # type of backup, full or incremental. Only editable by TrilioVaultsize: # size of backup. Only editable by TrilioVaultbackup_location: # location of the backup filepv_snapshot: #csi snapshot of the PV that correspond to this backup
Assuming that the latest backup id is day4 and number of backups to retain is three(3):
iterate through backups and identify the list of backups that are more than three (3). backups in an err state are not counted into the backups to retain. The oldest backup to retain is
backup_to_retain. Any backups older than
backup_to_retain should be merged with
backup_to_retain. Let this list be
backups_to_merge in time sorted order. Backups in the list are merged with the top of the list.
Assuming that there are five (5) backups and backup retention policy is set to three(3). The following diagram represents
delete_backup = backup_to_retain is fullbackups_to_commit = [backup_to_retain]for backup in backups_to_merge:if delete_backup:delete backupelse:if backup is full:delete_backup = Truebackups_to_commit.append(backup)for backup in backups_to_commit:for pv in backup:qemu-img commit pv_disk_imagefor pv in backup_to_retain:for backup in backups_to_commit:for pv1 in backup:if pv1.id is not pv.id:continueif pv1 is full backup:rename pv1 disk image to pv disk imagebreakmark backup_to_retain as full backupupdate the backup size to full backup sizefor pv in backup_to_retain:mark pv backup image as fullupdate pv size to full backup image sizefor backup in snaps_to_commit:delete backup
Deleting backups can be bit tricky. When a backup is deleted, TrilioVault generally does not try to delete corresponding images in the backup media. When the retention policy is engaged, the retention algorithm will consolidate the backup images.
Assuming here is the scenario:
The above application has three PVs. The PVs are added at different intervals of time and you can see PV1 has four (4) backups and PV2 has three (3) backups and PV3 has two (2) backups.
Now let's say a user decides to delete the Day1 backup. We will mark the corresponding TrilioVault backup object as deleted but will not delete the underlying backup image. Deleting the day1 backup of PV1 will break the chain and day2 backup becomes unusable. Instead we will leave the chain unchanged during backup delete operation. When the next retention algorithm is executed and for example the retention policy is set to four (4), the retention algorithm commits the day2 overlay file to the day1 and then renames day1 backup to day2.
If a user chooses to delete the day4 snapshot, then TrilioVault deletes backup images from the backup media.
If a backup of the application fails, then that backup images should not appear in any of the PVs backup images chain. The backup of an application may fail for various reasons and at various points of the backup process. In the above example, one of the PV backup failed to upload the data to the backup media. In that case none of the PVs backup chains should contain any backup images from this backup. Furthermore, the CSI snapshot of the PV of the last known good backup should be preserved and any CSI snapshots that were created for the current backup job should be "cleaned up" (eliminated). When the next backup is scheduled, incremental backups are generated with respect to the last known good backup job of the PV CSI snapshots.