The troubleshooting guide describes the the different phases of a backup and recovery process and which logs to check if manually troubleshooting issues.
Troubleshooting Guide
Troubleshooting the Trilio for Kubernetes (T4K) application is no different than troubleshooting any other Kubernetes application. You best friend is obviously kubectl for Kubernetes and oc for OpenShift. The commands are same for both tooling.
Successful Deployment
The following command displays the lists T4K Pods in a successful deployment. Control Plane Pod hosts controllers including Target, BackupPlan, Backup and Restore. Executor Pod includes job controllers that backup and restore controllers create.
It would be helpful to know different phases of backup and restore operations and where to find the corresponding logs for the different phases of an operation.
Broadly, the backup operation has the following phases namely MetaSnapshot,HookTargetIdentification, Quiesce, ImageBackup, DataSnapshot, Unquiesce, DataUpload, MetadataUpload, Retention and Cleanup.
Similarly, the restore operation has the following phases namely TargetValidation, Validation, PrimitiveMetadataRestore, DataRestore, DataOwnerUpdate, Unquiesce, MetadataRestore, RestoreCleanup, AddProtection, ImageRestore and HookTargetIdentification
In case backup or restore fails during any of the following phases, the first thing to make sure is that all the other workloads of T4K and cluster are running properly and also whether CSI snapshot controller is working properly.
To troubleshoot a backup or restore issue, first start with displaying backups with following commands.
See BACKUP STATUS column for more details.
master $ kubectl get backup
NAME BACKUPPLAN BACKUP TYPE STATUS DATA SIZE CREATION TIME START TIME END TIME PERCENTAGE COMPLETED BACKUP SCOPE DURATION
demo-backup demo-backupplan Full InProgress 7077888 2023-10-17T06:30:05Z 2023-10-17T06:30:05Z 20 Namespace 1m13s
master $ kubectl get backup
NAME BACKUPPLAN BACKUP TYPE STATUS DATA SIZE CREATION TIME START TIME END TIME PERCENTAGE COMPLETED BACKUP SCOPE DURATION
demo-backup demo-backupplan Full Failed 7077888 2023-10-17T23:00:03Z 2023-10-17T23:00:04Z 2023-10-17T23:02:24Z 31 Namespace 1m13s
The phase at which failure occurred can be found in the status of the output through above command. If the status doesn't have clear reason of failure, we need to check the logs of the pods would be generally in an Error state for that particular phase of backup or restore which failed.
If there no such pods in error state either, and none of the above steps are helpful, then we need to check the T4K control plane logs which we can collect using the log collector tool mentioned below.
Log collector
You can refer to the Log Collection page, collect logs and send it to the Trilio Team for further analysis of the issue.