Troubleshooting Guide

The troubleshooting guide describes the the different phases of a backup and recovery process and which logs to check if manually troubleshooting issues.

Troubleshooting Guide

Troubleshooting the Trilio for Kubernetes (T4K) application is no different than troubleshooting any other Kubernetes application. You best friend is obviously kubectl for Kubernetes and oc for OpenShift. The commands are same for both tooling.

Successful Deployment

The following command displays the lists T4K Pods in a successful deployment. Control Plane Pod hosts controllers including Target, BackupPlan, Backup and Restore. Executor Pod includes job controllers that backup and restore controllers create.

$ kubectl get pods -A | grep trilio
trilio-system                                      k8s-triliovault-admission-webhook-59bf44976-bvm4v                                          1/1     Running     0               25h
trilio-system                                      k8s-triliovault-control-plane-5769c9c965-k2jd6                                             2/2     Running     0               25h
trilio-system                                      k8s-triliovault-dex-586bcc8f9-td8gq                                                        1/1     Running     0               25h
trilio-system                                      k8s-triliovault-exporter-77dc69f795-l8crn                                                  1/1     Running     0               25h
trilio-system                                      k8s-triliovault-operator-5cbc888d4c-7ddg5                                                  1/1     Running     0               7m48s
trilio-system                                      k8s-triliovault-resource-cleaner-28293630-nzmqz                                            0/1     Completed   0               16m
trilio-system                                      k8s-triliovault-web-678c48864b-9gnjj                                                       1/1     Running     0               25h
trilio-system                                      k8s-triliovault-web-backend-d4dbddb4f-wsjrz                                                1/1     Running     0               25h

Make sure other artifacts of the Trilio deployment are in good shape.

#####oc get crds | grep trilio
backupplans.triliovault.trilio.io                                 2023-10-16T07:00:21Z
backups.triliovault.trilio.io                                     2023-10-16T07:00:21Z
clusterbackupplans.triliovault.trilio.io                          2023-10-16T07:00:21Z
clusterbackups.triliovault.trilio.io                              2023-10-16T07:00:21Z
clusterrestores.triliovault.trilio.io                             2023-10-16T07:00:21Z
consistentsets.triliovault.trilio.io                              2023-10-16T07:00:21Z
continuousrestoreplans.triliovault.trilio.io                      2023-10-16T07:00:22Z
hooks.triliovault.trilio.io                                       2023-10-16T07:00:22Z
licenses.triliovault.trilio.io                                    2023-10-16T07:00:22Z
policies.triliovault.trilio.io                                    2023-10-16T07:00:22Z
restores.triliovault.trilio.io                                    2023-10-16T07:00:22Z
targets.triliovault.trilio.io                                     2023-10-16T07:00:22Z
triliovaultmanagers.triliovault.trilio.io                         2023-10-16T06:57:15Z

Troubleshooting through Logs

It would be helpful to know different phases of backup and restore operations and where to find the corresponding logs for the different phases of an operation.

Broadly, the backup operation has the following phases namely MetaSnapshot, HookTargetIdentification, Quiesce, ImageBackup, DataSnapshot, Unquiesce, DataUpload, MetadataUpload, Retention and Cleanup.

Similarly, the restore operation has the following phases namely TargetValidation, Validation, PrimitiveMetadataRestore, DataRestore, DataOwnerUpdate, Unquiesce, MetadataRestore, RestoreCleanup, AddProtection, ImageRestore and HookTargetIdentification In case backup or restore fails during any of the following phases, the first thing to make sure is that all the other workloads of T4K and cluster are running properly and also whether CSI snapshot controller is working properly.

To troubleshoot a backup or restore issue, first start with displaying backups with following commands.

See BACKUP STATUS column for more details.

master $ kubectl get backup
NAME               BACKUPPLAN            BACKUP TYPE   STATUS       DATA SIZE   CREATION TIME          START TIME             END TIME           PERCENTAGE COMPLETED   BACKUP SCOPE   DURATION
demo-backup        demo-backupplan       Full          InProgress   7077888     2023-10-17T06:30:05Z   2023-10-17T06:30:05Z                      20                     Namespace      1m13s
master $ kubectl get backup
NAME               BACKUPPLAN            BACKUP TYPE   STATUS       DATA SIZE   CREATION TIME          START TIME             END TIME               PERCENTAGE COMPLETED   BACKUP SCOPE   DURATION
demo-backup        demo-backupplan       Full          Failed       7077888     2023-10-17T23:00:03Z   2023-10-17T23:00:04Z   2023-10-17T23:02:24Z   31                     Namespace      1m13s
master $ kubectl describe backup demo-backup
Name:         demo-backup
Namespace:    default
Labels:       app.kubernetes.io/managed-by=k8s-triliovault-ui
              app.kubernetes.io/name=k8s-triliovault
              app.kubernetes.io/part-of=k8s-triliovault
Annotations:  triliovault.trilio.io/creator: system:serviceaccount:default:k8s-triliovault
              triliovault.trilio.io/instance-id: 3c23759c-c6bc-4431-a948-c8a9b83a8d2a
              triliovault.trilio.io/updater:
                [{"username":"system:serviceaccount:default:k8s-triliovault","lastUpdatedTimestamp":"2023-10-17T23:00:04.048729325Z"}]
API Version:  triliovault.trilio.io/v1
Kind:         Backup
Metadata:
  Creation Timestamp:  2023-10-17T23:00:03Z
  Finalizers:
    backup-cleanup-finalizer
  Generation:  1
  Resource Version:        29232936
  UID:                     8185000b-9253-40f8-8744-552c62893fc3
Spec:
  Backup Plan:
    API Version:       triliovault.trilio.io/v1
    Kind:              BackupPlan
    Name:              demo-backupplan
    Namespace:         default
    Resource Version:  28572742
    UID:               984a46f2-d225-4008-8a11-469644a0d837
  Type:                Full
Status:
  Backup Scope:          Namespace
  Completion Timestamp:  2023-10-17T23:02:24Z
  Condition:
    Phase:                MetaSnapshot
    Reason:               MetaSnapshot InProgress
    Status:               InProgress
    Timestamp:            2023-10-17T23:00:04Z
    Phase:                MetaSnapshot
    Reason:               MetaSnapshot Completed
    Status:               Completed
    Timestamp:            2023-10-17T23:01:17Z
    Phase:                HookTargetIdentification
    Reason:               HookTargetIdentification Failed
    Status:               Failed
    Timestamp:            2023-10-17T23:01:17Z
    Phase:                MetadataUpload
    Reason:               MetadataUpload InProgress
    Status:               InProgress
    Timestamp:            2023-10-17T23:01:17Z
    Phase:                MetadataUpload
    Reason:               MetadataUpload Completed
    Status:               Completed
    Timestamp:            2023-10-17T23:02:24Z
  Duration:               2m20s
  Encryption Enabled:     false
  Expiration Timestamp:   2023-10-22T23:00:00Z
  Location:               984a46f2-d225-4008-8a11-469644a0d837/8185000b-9253-40f8-8744-552c62893fc3
  Metadata Size:          6557696
  Percentage Completion:  31
  Phase:                  MetadataUpload
  Phase Status:           Completed
  Size:                   6557696
  Snapshot:
    Custom:
      Resources:
        Group Version Kind:
          Kind:     ConfigMap
          Version:  v1
        Objects:
          kube-root-ca.crt
        Group Version Kind:
          Kind:     ServiceAccount
          Version:  v1
        Objects:
          default
  Start Timestamp:  2023-10-17T23:00:04Z
  Stats:
    Hook Exists:  true
    Target Info:
      Target:
        API Version:       triliovault.trilio.io/v1
        Kind:              Target
        Name:              demo-target
        Namespace:         default
        Resource Version:  28338697
        UID:               15a18f3f-9bfb-4c81-9407-738a7cc484ca
      Type:                NFS
      Vendor:              Other
  Status:                  Failed
  Type:                    Full
Events:                    <none>

The phase at which failure occurred can be found in the status of the output through above command. If the status doesn't have clear reason of failure, we need to check the logs of the pods would be generally in an Error state for that particular phase of backup or restore which failed.

If there no such pods in error state either, and none of the above steps are helpful, then we need to check the T4K control plane logs which we can collect using the log collector tool mentioned below.

Log collector

You can refer to the Log Collection page, collect logs and send it to the Trilio Team for further analysis of the issue.