Troubleshooting Guide

The troubleshooting guide describes the the different phases of a backup and recovery process and which logs to check if manually troubleshooting issues.

Troubleshooting Guide

Troubleshooting the Trilio for Kubernetes (T4K) application is no different than troubleshooting any other Kubernetes application. You best friend is obviously kubectl for Kubernetes and oc for OpenShift. The commands are same for both tooling.

Successful Deployment

The following command displays the lists T4K Pods in a successful deployment. Control Plane Pod hosts controllers including Target, BackupPlan, Backup and Restore. Executor Pod includes job controllers that backup and restore controllers create.

$ kubectl get pods -A | grep trilio
trilio-system                                      k8s-triliovault-admission-webhook-59bf44976-bvm4v                                          1/1     Running     0               25h
trilio-system                                      k8s-triliovault-control-plane-5769c9c965-k2jd6                                             2/2     Running     0               25h
trilio-system                                      k8s-triliovault-dex-586bcc8f9-td8gq                                                        1/1     Running     0               25h
trilio-system                                      k8s-triliovault-exporter-77dc69f795-l8crn                                                  1/1     Running     0               25h
trilio-system                                      k8s-triliovault-operator-5cbc888d4c-7ddg5                                                  1/1     Running     0               7m48s
trilio-system                                      k8s-triliovault-resource-cleaner-28293630-nzmqz                                            0/1     Completed   0               16m
trilio-system                                      k8s-triliovault-web-678c48864b-9gnjj                                                       1/1     Running     0               25h
trilio-system                                      k8s-triliovault-web-backend-d4dbddb4f-wsjrz                                                1/1     Running     0               25h

Make sure other artifacts of the Trilio deployment are in good shape.

#####oc get crds | grep trilio
backupplans.triliovault.trilio.io                                 2023-10-16T07:00:21Z
backups.triliovault.trilio.io                                     2023-10-16T07:00:21Z
clusterbackupplans.triliovault.trilio.io                          2023-10-16T07:00:21Z
clusterbackups.triliovault.trilio.io                              2023-10-16T07:00:21Z
clusterrestores.triliovault.trilio.io                             2023-10-16T07:00:21Z
consistentsets.triliovault.trilio.io                              2023-10-16T07:00:21Z
continuousrestoreplans.triliovault.trilio.io                      2023-10-16T07:00:22Z
hooks.triliovault.trilio.io                                       2023-10-16T07:00:22Z
licenses.triliovault.trilio.io                                    2023-10-16T07:00:22Z
policies.triliovault.trilio.io                                    2023-10-16T07:00:22Z
restores.triliovault.trilio.io                                    2023-10-16T07:00:22Z
targets.triliovault.trilio.io                                     2023-10-16T07:00:22Z
triliovaultmanagers.triliovault.trilio.io                         2023-10-16T06:57:15Z

Troubleshooting through Logs

It would be helpful to know different phases of backup and restore operations and where to find the corresponding logs for the different phases of an operation.

Broadly, the backup operation has the following phases namely MetaSnapshot, HookTargetIdentification, Quiesce, ImageBackup, DataSnapshot, Unquiesce, DataUpload, MetadataUpload, Retention and Cleanup.

Similarly, the restore operation has the following phases namely TargetValidation, Validation, PrimitiveMetadataRestore, DataRestore, DataOwnerUpdate, Unquiesce, MetadataRestore, RestoreCleanup, AddProtection, ImageRestore and HookTargetIdentification In case backup or restore fails during any of the following phases, the first thing to make sure is that all the other workloads of T4K and cluster are running properly and also whether CSI snapshot controller is working properly.

To troubleshoot a backup or restore issue, first start with displaying backups with following commands.

See BACKUP STATUS column for more details.

The phase at which failure occurred can be found in the status of the output through above command. If the status doesn't have clear reason of failure, we need to check the logs of the pods would be generally in an Error state for that particular phase of backup or restore which failed.

If there no such pods in error state either, and none of the above steps are helpful, then we need to check the T4K control plane logs which we can collect using the log collector tool mentioned below.

Log collector

You can refer to the Log Collection page, collect logs and send it to the Trilio Team for further analysis of the issue.

Last updated

Was this helpful?