Retention Process
This section describes Trilio for Kubernetes retention process for Backup/Snapshot
Last updated
This section describes Trilio for Kubernetes retention process for Backup/Snapshot
Last updated
A Kubernetes application may have multiple PVs, but the Trilio retention policy works at PV level. In this document, we will discuss Trilio's implementation of retention policy. Before we dive into the internals of a retention processes, there are a few things to keep in mind about Trilio backup images:
All backup images are QCOW2 images. Full backups are base images and incremental backups are overlay file. Overlay files have backing file reference to previous backups. The number of overlay files depends on number of incremental backups with the latest overlay file represents the latest backup file.
Trilio supports forever incrementals. Incremental backups are efficient from both network bandwidth usage perspective and data storage perspective. Once a full backup is taken, user does not need to take another full backup, improving overall backup process efficiency.
A Synthetic full backup is the process of creating full backup image at the backup target by combining one or more backup images. This feature avoids the need to take full backups. Trilio supports synthetic full backups and improves the backup process performance.
The following section describes the retention policy per PV. Let's discuss the retention policy in following scenarios.
Let's assume that the number of Backups to retain is three (3) and the backup policy is set to "forever incremental". Four days backup looks as follows:
Day 1:
Day 2:
Day 3:
Day 4:
Day 1:
Day 2:
Day 3:
Day 4:
Day 5:
In the last two scenarios, the retention policy is implemented on a PV basis. For a complex application that has many more PVs, we may get into some interesting scenarios.
Whether to perform a full backup or incremental backup is first determined by the backup policy that is chosen at the time of backup job creation. The first backup always full and the subsequent backup type depends on the backup policy. If a user chooses a full backup after every three backups, the backup types look like:
F ← I ← I F ← I ← I F ← I ← I
When F is for Full and I is for Incremental.
However if a new PV is added to an application in between two backups, the backup for the newly added PV must be full backup and incremental for all existing PVs. The retention policy for each PV still follows above algorithm.
Whether to perform a full backup or incremental backup also depends on whether a CSI snapshot exists for the given PV. For newly added PVs, Trilio creates first CSI snapshot to ensure a full backup. The above check also works if a user deletes the Trilio generated CSI snapshot for a given PV. In this case the next backup for the PV must be full backup.
Backup in this discussion is the application backup that Trilio takes. A PV snapshot is what CSI accomplishes. We also assume that the backup CRD will have following fields. In actuality it may have many more, but the retention algorithm relies on these fields.
Assuming that the latest backup id is day4 and number of backups to retain is three(3):
iterate through backups and identify the list of backups that are more than three (3). backups in an err state are not counted into the backups to retain. The oldest backup to retain is backup_to_retain
. Any backups older than backup_to_retain
should be merged with backup_to_retain
. Let this list be backups_to_merge
in time sorted order. Backups in the list are merged with the top of the list.
Assuming that there are five (5) backups and backup retention policy is set to three(3). The following diagram represents backup_to_retain
and backups_to_merge
.
Deleting backups can be bit tricky. When a backup is deleted, Trilio generally does not try to delete corresponding images in the backup media. When the retention policy is engaged, the retention algorithm will consolidate the backup images.
Assuming here is the scenario:
The above application has three PVs. The PVs are added at different intervals of time and you can see PV1 has four (4) backups and PV2 has three (3) backups and PV3 has two (2) backups.
Now let's say a user decides to delete the Day1 backup. We will mark the corresponding Trilio backup object as deleted but will not delete the underlying backup image. Deleting the day1 backup of PV1 will break the chain and day2 backup becomes unusable. Instead we will leave the chain unchanged during backup delete operation. When the next retention algorithm is executed and for example the retention policy is set to four (4), the retention algorithm commits the day2 overlay file to the day1 and then renames day1 backup to day2.
If a user chooses to delete the day4 snapshot, then Trilio deletes backup images from the backup media.
If a backup of the application fails, then that backup images should not appear in any of the PVs backup images chain. The backup of an application may fail for various reasons and at various points of the backup process. In the above example, one of the PV backup failed to upload the data to the backup media. In that case none of the PVs backup chains should contain any backup images from this backup. Furthermore, the CSI snapshot of the PV of the last known good backup should be preserved and any CSI snapshots that were created for the current backup job should be "cleaned up" (eliminated). When the next backup is scheduled, incremental backups are generated with respect to the last known good backup job of the PV CSI snapshots.
Immutable backups handle retention differently than standard Trilio backups. Since these backups cannot be altered or modified, the retention process surrounding them is unique. Each full backup has a maximum number of incremental backups, with the expiration of any backup corresponding to the expiration of the last backup in the entire chain.
When determining the retention period for a full backup, consider both the applied schedule policy and the maximum number of incremental backups per full backup (MaxIncrBackupsPerFullBackup
). By taking into account these factors, you can calculate when the next full backup will occur and the expiration of the previous backup chain. Essentially, the retention period for all backups within the chain will be the same and equal to the expiration of the last incremental backup.
The retention job will not remove backups from the target storage as it would with standard backups; it will only delete the backup Custom Resource (CR) from the cluster once the expiration date is reached. Deletion of the actual backups from the target storage will be managed by the retention period set on the S3 bucket.
In case of Snapshots, we are maintaining the volumeSnapshots of the PersistentVolumes in the cluster. We are keeping only the application resources backup on the target. Also, Snapshots are Full Snapshots always. So, As a part of retention process, there is no need of merging the Snapshots. T4k is Simply deleting the Snapshots which are not required to retain.
Day 1:
Day 2:
Day 3:
Let's say Snapshots are starting on Monday and user wants to retain the Monday's Snapshot as Weekly Snapshot and latest 2 Snapshots at any point of time.
Day 1:
Day 2:
Day 3:
Day 4: