Troubleshooting Guide

The troubleshooting guide describes the the different phases of a backup and recovery process and which logs to check if manually troubleshooting issues.

Troubleshooting the TrilioVault for Kubernetes (TVK) application is no different than troubleshooting any other Kubernetes application. You best friend is obviously kubectl for Kubernetes and oc for OpenShift. The commands are same for both tooling.

Successful Deployment

The following command displays the lists TVK Pods in a successful deployment. Control Plane Pod hosts controllers including Target, BackupPlan, Backup and Restore. Executor Pod includes job controllers that backup and restore controllers create.

$ kubectl get pods -A | grep trilio
openshift-operators k8s-triliovault-admission-webhook-68494db64c-drsb4 1/1 Running 0 3d9h
openshift-operators k8s-triliovault-control-plane-cdd4864c4-p2crc 1/1 Running 0 3d9h
openshift-operators k8s-triliovault-exporter-8598b65b56-rjzlf 1/1 Running 0 3d9h

Make sure other artifacts of the TrilioVault deployment are in good shape.

#####oc get crds | grep trilio
backupplans.triliovault.trilio.io 2020-04-30T20:07:38Z
backups.triliovault.trilio.io 2020-04-30T20:07:38Z
hooks.triliovault.trilio.io 2020-04-30T20:07:38Z
policies.triliovault.trilio.io 2020-04-30T20:07:38Z
restores.triliovault.trilio.io 2020-04-30T20:07:38Z
targets.triliovault.trilio.io 2020-04-30T20:07:38Z

Backup and Restore Phases

It would be helpful to understand different phases of backup and restore operations and where to find the corresponding logs for the different phases of an operation.

Backup Phases

Snapshot: In this phase, TVK performs the snapshot of the Persistent Volume (PV) using the CSI driver functionality. If the backup fails at this step we can check logs of following TVK pods, making sure the CSI snapshot are working manually.

Upload: In this phase, TVK uploads the data and metadata to the target. TVK creates multiple pods dynamically depending on number of PVs associated with the application .

Retention: This is last phase, of the backup process where TVK validates the retention policy and performs a merge operation on the backup if the purging of backups gets activated based on the retention policy.

If the backup fails log of following pod will help to provide more details:

k8s-triliovault-control-plane-xxxxxxxx

Restore Phases

Validation: In this phase, TVK does the validation check of resource in the namespace where restore operation is specified. If there are any resource with same name that will be getting restored restore will fail

Data Restore: In this phase, TVK creats the PV and then copies the data from the target into the PV which will be attached to pods.

Metadata Restore: In this phase, TVK does the restore of all the resource which were backed up. These can be pods, secret, service etc.

If the restore fails at any step logs from following pod can add more detail

k8s-triliovault-control-plane-xxxxxxxx

Once the restore operation has successfully completed, you can list all components of the application -pods PV's and all the resources to make sure the application is restored.

Troubleshooting through Logs

To troubleshoot a backup or restore issue, first start with displaying backups with following commands.

See BACKUP PHASE column for more details.

master $ kubectl get backup
NAME APPLICATION BACKUP TYPE STATUS START TIME BACKUP PHASE PERCENTAGE COMPLETED
demo-full-backup backup-job-k8s-demo-app Full InProgress 2s Snapshot
master $ kubectl get backup
NAME APPLICATION BACKUP TYPE STATUS START TIME BACKUP PHASE PERCENTAGE COMPLETED
demo-full-backup backup-job-k8s-demo-app Full InProgress 39s Upload 30
master $ kubectl describe backup demo-full-backup
Name: demo-full-backup
Namespace:
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"triliovault.trilio.io/v1alpha1","kind":"Backup","metadata":{"annotations":{},"name":"demo-full-backup"},"spec":{"applicatio...
API Version: triliovault.trilio.io/v1alpha1
Kind: Backup
Metadata:
Creation Timestamp: 2020-03-24T19:15:43Z
Generation: 1
Owner References:
API Version: triliovault.trilio.io/v1alpha1
Kind: Application
Name: backup-job-k8s-demo-app
UID: 6b4f60dd-17cd-4413-85fb-e30952c6cf19
Resource Version: 1417
Self Link: /apis/triliovault.trilio.io/v1alpha1/backups/demo-full-backup
UID: 50264019-e235-4054-8510-8c5b4947c0f6
Spec:
Application:
API Version: triliovault.trilio.io/v1alpha1
Kind: Application
Name: backup-job-k8s-demo-app
Resource Version: 1327
UID: 6b4f60dd-17cd-4413-85fb-e30952c6cf19
Schedule Type: Periodic
Type: Full
Status:
Percentage Completion: 30
Phase: Upload
Phase Status: InProgress
Size: 0
Snapshot Content:
Custom:
Component:
Group Version Kind:
Kind: Secret
Version: v1
Metadata:
{"apiVersion":"v1","data":{"password":"dHJpbGlvcGFzcwo="},"kind":"Secret","metadata":{"labels":{"app":"k8s-demo-app","tier":"frontend"},"name":"mysql-pass","namespace":"default"},"type":"Opaque"}
Group Version Kind:
Kind: Service
Version: v1
Metadata:
{"apiVersion":"v1","kind":"Service","metadata":{"labels":{"app":"k8s-demo-app","tier":"frontend"},"name":"k8s-demo-app-frontend","namespace":"default"},"spec":{"ports":[{"name":"web","port":80,"protocol":"TCP","targetPort":80}],"selector":{"app":"k8s-demo-app","tier":"frontend"},"sessionAffinity":"None","type":"ClusterIP"}}
{"apiVersion":"v1","kind":"Service","metadata":{"labels":{"app":"k8s-demo-app","tier":"mysql"},"name":"k8s-demo-app-mysql","namespace":"default"},"spec":{"ports":[{"port":3306,"protocol":"TCP","targetPort":3306}],"selector":{"app":"k8s-demo-app","tier":"mysql"},"sessionAffinity":"None","type":"ClusterIP"}}
Group Version Kind:
Group: apps
Kind: Deployment
Version: v1
Metadata:
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"deployment.kubernetes.io/revision":"1"},"labels":{"app":"k8s-demo-app","tier":"frontend"},"name":"k8s-demo-app-frontend","namespace":"default"},"spec":{"progressDeadlineSeconds":600,"replicas":3,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app":"k8s-demo-app","tier":"frontend"}},"strategy":{"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"},"template":{"metadata":{"labels":{"app":"k8s-demo-app","tier":"frontend"}},"spec":{"containers":[{"image":"docker.io/trilio/k8s-demo-app:v1","imagePullPolicy":"IfNotPresent","name":"demoapp-frontend","ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"terminationGracePeriodSeconds":30}}}}
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"deployment.kubernetes.io/revision":"1"},"labels":{"app":"k8s-demo-app","tier":"mysql"},"name":"k8s-demo-app-mysql","namespace":"default"},"spec":{"progressDeadlineSeconds":600,"replicas":1,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app":"k8s-demo-app","tier":"mysql"}},"strategy":{"type":"Recreate"},"template":{"metadata":{"labels":{"app":"k8s-demo-app","tier":"mysql"}},"spec":{"containers":[{"env":[{"name":"MYSQL_ROOT_PASSWORD","valueFrom":{"secretKeyRef":{"key":"password","name":"mysql-pass"}}}],"image":"mysql:5.6","imagePullPolicy":"IfNotPresent","name":"mysql","ports":[{"containerPort":3306,"name":"mysql","protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","volumeMounts":[{"mountPath":"/var/lib/mysql","name":"mysql-persistent-storage"}]}],"dnsPolicy":"ClusterFirst","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"terminationGracePeriodSeconds":30,"volumes":[{"name":"mysql-persistent-storage","persistentVolumeClaim":{"claimName":"mysql-pv-claim"}}]}}}}
Group Version Kind:
Group: apps
Kind: ReplicaSet
Version: v1
Metadata:
{"apiVersion":"apps/v1","kind":"ReplicaSet","metadata":{"annotations":{"deployment.kubernetes.io/desired-replicas":"3","deployment.kubernetes.io/max-replicas":"4","deployment.kubernetes.io/revision":"1"},"labels":{"app":"k8s-demo-app","pod-template-hash":"6544df7845","tier":"frontend"},"name":"k8s-demo-app-frontend-6544df7845","namespace":"default","ownerReferences":[{"apiVersion":"apps/v1","blockOwnerDeletion":true,"controller":true,"kind":"Deployment","name":"k8s-demo-app-frontend","uid":"63c2334c-9e48-4119-b333-4bd47a3f824e"}]},"spec":{"replicas":3,"selector":{"matchLabels":{"app":"k8s-demo-app","pod-template-hash":"6544df7845","tier":"frontend"}},"template":{"metadata":{"labels":{"app":"k8s-demo-app","pod-template-hash":"6544df7845","tier":"frontend"}},"spec":{"containers":[{"image":"docker.io/trilio/k8s-demo-app:v1","imagePullPolicy":"IfNotPresent","name":"demoapp-frontend","ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"terminationGracePeriodSeconds":30}}}}
{"apiVersion":"apps/v1","kind":"ReplicaSet","metadata":{"annotations":{"deployment.kubernetes.io/desired-replicas":"1","deployment.kubernetes.io/max-replicas":"1","deployment.kubernetes.io/revision":"1"},"labels":{"app":"k8s-demo-app","pod-template-hash":"765495d764","tier":"mysql"},"name":"k8s-demo-app-mysql-765495d764","namespace":"default","ownerReferences":[{"apiVersion":"apps/v1","blockOwnerDeletion":true,"controller":true,"kind":"Deployment","name":"k8s-demo-app-mysql","uid":"8a5e4484-8b5a-4a96-9957-14dd502cb5b6"}]},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"k8s-demo-app","pod-template-hash":"765495d764","tier":"mysql"}},"template":{"metadata":{"labels":{"app":"k8s-demo-app","pod-template-hash":"765495d764","tier":"mysql"}},"spec":{"containers":[{"env":[{"name":"MYSQL_ROOT_PASSWORD","valueFrom":{"secretKeyRef":{"key":"password","name":"mysql-pass"}}}],"image":"mysql:5.6","imagePullPolicy":"IfNotPresent","name":"mysql","ports":[{"containerPort":3306,"name":"mysql","protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","volumeMounts":[{"mountPath":"/var/lib/mysql","name":"mysql-persistent-storage"}]}],"dnsPolicy":"ClusterFirst","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"terminationGracePeriodSeconds":30,"volumes":[{"name":"mysql-persistent-storage","persistentVolumeClaim":{"claimName":"mysql-pv-claim"}}]}}}}
Data Snapshot:
Persistent Volume Claim Metadata: {"kind":"PersistentVolumeClaim","apiVersion":"v1","metadata":{"name":"mysql-pv-claim","namespace":"default","selfLink":"/api/v1/namespaces/default/persistentvolumeclaims/mysql-pv-claim","uid":"05a835dc-045d-422c-8940-0a4b4917fa44","resourceVersion":"1100","creationTimestamp":"2020-03-24T19:13:38Z","labels":{"app":"k8s-demo-app","tier":"mysql"},"annotations":{"pv.kubernetes.io/bind-completed":"yes","pv.kubernetes.io/bound-by-controller":"yes","volume.beta.kubernetes.io/storage-provisioner":"hostpath.csi.k8s.io"},"finalizers":["kubernetes.io/pvc-protection"]},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"5Gi"}},"volumeName":"pvc-05a835dc-045d-422c-8940-0a4b4917fa44","storageClassName":"csi-hostpath-sc","volumeMode":"Filesystem"},"status":{"phase":"Bound","accessModes":["ReadWriteOnce"],"capacity":{"storage":"5Gi"}}}
Persistent Volume Claim Name: mysql-pv-claim
Pod Containers Map:
Containers:
mysql
Pod Name: k8s-demo-app-mysql-765495d764-56f6b
Size: 0
Snapshot Size: 0
Volume Snapshot:
Retry Count: 1
Status: Completed
Volume Snapshot:
API Version: snapshot.storage.k8s.io/v1alpha1
Kind: VolumeSnapshot
Name: mysql-pv-claim-f727ec86-f3a1-4eff-83b9-99a2e30d331e
Namespace: default
Resource Version: 1352
UID: 10bc3f16-3ad5-4534-86e4-f556e8d932be
Start Timestamp: 2020-03-24T19:15:43Z
Status: InProgress
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackupUpdateFailed 68s backup-controller Updating Backup: demo-full-backup, Failed%!(EXTRA string=)
master $ kubectl get backup
NAME APPLICATION BACKUP TYPE STATUS START TIME BACKUP PHASE PERCENTAGE COMPLETED
demo-full-backup backup-job-k8s-demo-app Full Completed 119s Retention 100

If the backup phase or restore phase is in Validation, then you need to look into the Control Plane Pod for root cause.

Log collector

You can use this process to collect the logs and send it to Trilio Team for further analysis on the issue. This script will create triliovault-<date-time>.zip zip file containing cluster debugging information.

Pre-requisite : python >= 3.6

pip3 install k8s-triliovault-logcollector --extra-index-url https://pypi.fury.io/k8s-triliovault/
log_collector.py

Optional arguments:

Parameter

Default

Description

--clustered

false

whether clustered installtion of trilio application

--namespaces

[]

list of namespaces to look for resources

--kube_config

~/.kube/config

path to the kubernetes config

--no-clean

false

don't clean output directory after zip

--log-level

INFO

log level for debugging