Troubleshooting Guide
The troubleshooting guide describes the the different phases of a backup and recovery process and which logs to check if manually troubleshooting issues.
Troubleshooting Guide
Troubleshooting the Trilio for Kubernetes (T4K) application is no different than troubleshooting any other Kubernetes application. You best friend is obviously kubectl
for Kubernetes and oc
for OpenShift. The commands are same for both tooling.
Successful Deployment
The following command displays the lists T4K Pods in a successful deployment. Control Plane Pod hosts controllers including Target, BackupPlan, Backup and Restore. Executor Pod includes job controllers that backup and restore controllers create.
$ kubectl get pods -A | grep trilio
trilio-system k8s-triliovault-admission-webhook-59bf44976-bvm4v 1/1 Running 0 25h
trilio-system k8s-triliovault-control-plane-5769c9c965-k2jd6 2/2 Running 0 25h
trilio-system k8s-triliovault-dex-586bcc8f9-td8gq 1/1 Running 0 25h
trilio-system k8s-triliovault-exporter-77dc69f795-l8crn 1/1 Running 0 25h
trilio-system k8s-triliovault-operator-5cbc888d4c-7ddg5 1/1 Running 0 7m48s
trilio-system k8s-triliovault-resource-cleaner-28293630-nzmqz 0/1 Completed 0 16m
trilio-system k8s-triliovault-web-678c48864b-9gnjj 1/1 Running 0 25h
trilio-system k8s-triliovault-web-backend-d4dbddb4f-wsjrz 1/1 Running 0 25h
Make sure other artifacts of the Trilio deployment are in good shape.
#####oc get crds | grep trilio
backupplans.triliovault.trilio.io 2023-10-16T07:00:21Z
backups.triliovault.trilio.io 2023-10-16T07:00:21Z
clusterbackupplans.triliovault.trilio.io 2023-10-16T07:00:21Z
clusterbackups.triliovault.trilio.io 2023-10-16T07:00:21Z
clusterrestores.triliovault.trilio.io 2023-10-16T07:00:21Z
consistentsets.triliovault.trilio.io 2023-10-16T07:00:21Z
continuousrestoreplans.triliovault.trilio.io 2023-10-16T07:00:22Z
hooks.triliovault.trilio.io 2023-10-16T07:00:22Z
licenses.triliovault.trilio.io 2023-10-16T07:00:22Z
policies.triliovault.trilio.io 2023-10-16T07:00:22Z
restores.triliovault.trilio.io 2023-10-16T07:00:22Z
targets.triliovault.trilio.io 2023-10-16T07:00:22Z
triliovaultmanagers.triliovault.trilio.io 2023-10-16T06:57:15Z
Troubleshooting through Logs
It would be helpful to know different phases of backup and restore operations and where to find the corresponding logs for the different phases of an operation.
Broadly, the backup operation has the following phases namely MetaSnapshot, HookTargetIdentification, Quiesce, ImageBackup, DataSnapshot, Unquiesce, DataUpload, MetadataUpload, Retention and Cleanup.
Similarly, the restore operation has the following phases namely TargetValidation, Validation, PrimitiveMetadataRestore, DataRestore, DataOwnerUpdate, Unquiesce, MetadataRestore, RestoreCleanup, AddProtection, ImageRestore and HookTargetIdentification In case backup or restore fails during any of the following phases, the first thing to make sure is that all the other workloads of T4K and cluster are running properly and also whether CSI snapshot controller is working properly.
To troubleshoot a backup or restore issue, first start with displaying backups with following commands.
See BACKUP STATUS column for more details.
master $ kubectl get backup
NAME BACKUPPLAN BACKUP TYPE STATUS DATA SIZE CREATION TIME START TIME END TIME PERCENTAGE COMPLETED BACKUP SCOPE DURATION
demo-backup demo-backupplan Full InProgress 7077888 2023-10-17T06:30:05Z 2023-10-17T06:30:05Z 20 Namespace 1m13s
master $ kubectl get backup
NAME BACKUPPLAN BACKUP TYPE STATUS DATA SIZE CREATION TIME START TIME END TIME PERCENTAGE COMPLETED BACKUP SCOPE DURATION
demo-backup demo-backupplan Full Failed 7077888 2023-10-17T23:00:03Z 2023-10-17T23:00:04Z 2023-10-17T23:02:24Z 31 Namespace 1m13s
master $ kubectl describe backup demo-backup
Name: demo-backup
Namespace: default
Labels: app.kubernetes.io/managed-by=k8s-triliovault-ui
app.kubernetes.io/name=k8s-triliovault
app.kubernetes.io/part-of=k8s-triliovault
Annotations: triliovault.trilio.io/creator: system:serviceaccount:default:k8s-triliovault
triliovault.trilio.io/instance-id: 3c23759c-c6bc-4431-a948-c8a9b83a8d2a
triliovault.trilio.io/updater:
[{"username":"system:serviceaccount:default:k8s-triliovault","lastUpdatedTimestamp":"2023-10-17T23:00:04.048729325Z"}]
API Version: triliovault.trilio.io/v1
Kind: Backup
Metadata:
Creation Timestamp: 2023-10-17T23:00:03Z
Finalizers:
backup-cleanup-finalizer
Generation: 1
Resource Version: 29232936
UID: 8185000b-9253-40f8-8744-552c62893fc3
Spec:
Backup Plan:
API Version: triliovault.trilio.io/v1
Kind: BackupPlan
Name: demo-backupplan
Namespace: default
Resource Version: 28572742
UID: 984a46f2-d225-4008-8a11-469644a0d837
Type: Full
Status:
Backup Scope: Namespace
Completion Timestamp: 2023-10-17T23:02:24Z
Condition:
Phase: MetaSnapshot
Reason: MetaSnapshot InProgress
Status: InProgress
Timestamp: 2023-10-17T23:00:04Z
Phase: MetaSnapshot
Reason: MetaSnapshot Completed
Status: Completed
Timestamp: 2023-10-17T23:01:17Z
Phase: HookTargetIdentification
Reason: HookTargetIdentification Failed
Status: Failed
Timestamp: 2023-10-17T23:01:17Z
Phase: MetadataUpload
Reason: MetadataUpload InProgress
Status: InProgress
Timestamp: 2023-10-17T23:01:17Z
Phase: MetadataUpload
Reason: MetadataUpload Completed
Status: Completed
Timestamp: 2023-10-17T23:02:24Z
Duration: 2m20s
Encryption Enabled: false
Expiration Timestamp: 2023-10-22T23:00:00Z
Location: 984a46f2-d225-4008-8a11-469644a0d837/8185000b-9253-40f8-8744-552c62893fc3
Metadata Size: 6557696
Percentage Completion: 31
Phase: MetadataUpload
Phase Status: Completed
Size: 6557696
Snapshot:
Custom:
Resources:
Group Version Kind:
Kind: ConfigMap
Version: v1
Objects:
kube-root-ca.crt
Group Version Kind:
Kind: ServiceAccount
Version: v1
Objects:
default
Start Timestamp: 2023-10-17T23:00:04Z
Stats:
Hook Exists: true
Target Info:
Target:
API Version: triliovault.trilio.io/v1
Kind: Target
Name: demo-target
Namespace: default
Resource Version: 28338697
UID: 15a18f3f-9bfb-4c81-9407-738a7cc484ca
Type: NFS
Vendor: Other
Status: Failed
Type: Full
Events: <none>
The phase at which failure occurred can be found in the status of the output through above command. If the status doesn't have clear reason of failure, we need to check the logs of the pods would be generally in an Error state for that particular phase of backup or restore which failed.
If there no such pods in error state either, and none of the above steps are helpful, then we need to check the T4K control plane logs which we can collect using the log collector tool mentioned below.
Log collector
You can refer to the Log Collection page, collect logs and send it to the Trilio Team for further analysis of the issue.
Last updated
Was this helpful?