> For the complete documentation index, see [llms.txt](https://docs.trilio.io/kubernetes/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.trilio.io/kubernetes/configuration/observability/t4k-integration-with-observability-stack.md). # T4K Integration with Observability Stack ## Introduction The Observability Stack is a pre-packaged distribution for monitoring, logging, and dashboarding and can be installed into any existing Kubernetes cluster. It includes many of the most popular open-source observability tools with Prometheus, Grafana, Promtail\*\*,\*\* and Loki. The observability stack provides a straightforward, maintainable solution for analyzing server traffic and identifying potential deployment problems. ### T4K Installation with Observability using Trilio Operator To install the operator with observability enabled, run the latest helm chart with the following parameter set. ``` helm repo add triliovault-operator https://charts.k8strilio.net/trilio-stable/k8s-triliovault-operator helm install tvm triliovault-operator/k8s-triliovault-operator --set observability.enabled=true ``` ### Observability Stack Configurable Parameters The following table lists the configuration parameters of the observability stack

Parameter	Description	Default
`observability.enabled`	observability stack is enabled	false
`observability.name`	observability name for T4K integration	tvk-integration
`observability.logging.loki.enabled`	logging stack, loki is enabled	true
`observability.logging.loki.fullnameOverride`	name of the loki service	"loki"
`observability.logging.loki.singleBinary.persistence.enabled`	loki persistence storage enabled	true
`observability.logging.loki.singleBinary.persistence.accessModes`	loki persistence storage accessModes	ReadWriteOnce
`observability.logging.loki.singleBinary.persistence.size`	loki persistence storage size	10Gi
`observability.logging.loki.loki.limits_config.reject_old_samples_max_age`	loki config, maximum accepted sample age before rejecting	168h
`observability.logging.loki.tableManager.retention_period`	loki config, how far back tables will be kept before they are deleted. 0s disables deletion.	168h
`observability.logging.promtail.enabled`	logging stack, promtail is enabled	true
`observability.logging.promtail.fullnameOverride`	name of the promtail service	"promtail"
`observability.logging.promtail.config.clients.url`	loki url for promtail integration	"http://loki:3100/loki/api/v1/push"
`observability.monitoring.prometheus.enabled`	monitoring stack, prometheus is enabled	true
`observability.monitoring.prometheus.fullnameOverride`	name of the prometheus service	"prom"
`observability.monitoring.prometheus.server.enabled`	prometheus server is enabled	true
`observability.monitoring.prometheus.server.fullnameOverride`	name of prometheus server service	"prom-server"
`observability.monitoring.prometheus.server.persistentVolume.enabled`	prometheus server with persistent volume is enabled	false
`observability.monitoring.prometheus.kube-state-metrics.enabled`	prometheus kube state metrics is enabled	false
`observability.monitoring.prometheus.prometheus-node-exporter.enabled`	prometheus node exporter is enabled	false
`observability.monitoring.prometheus.prometheus-pushgateway.enabled`	prometheus push gateway is enabled	false
`observability.monitoring.prometheus.alertmanager.enabled`	prometheus alert manager is enabled	false
`observability.visualization.grafana.enabled`	visualization stack, grafana is enabled	true
`observability.visualization.grafana.adminPassword`	grafana password for admin user	"admin123"
`observability.visualization.grafana.fullnameOverride`	name of grafana service	"grafana"
`observability.visualization.grafana.service.type`	grafana service type	"ClusterIP"

Check the observability stack configuration by running the following command: ``` kubectl get pods -n promtail-2zpcv 1/1 Running 0 2m16s grafana-554cb4f55-q4q59 3/3 Running 0 2m15s prom-server-786b8cf897-nglhh 2/2 Running 0 2m15s k8s-triliovault-operator-85dfc877b8-5xqx9 1/1 Running 0 2m15s loki-0 1/1 Running 0 2m15s k8s-triliovault-admission-webhook-96db687bb-wnfh7 1/1 Running 0 62s k8s-triliovault-control-plane-6b986c8fb9-zjbnj 2/2 Running 0 62s k8s-triliovault-exporter-7b98cb7678-wxwvx 1/1 Running 0 62s k8s-triliovault-ingress-nginx-controller-57b777f45b-dnjkv 1/1 Running 0 62s k8s-triliovault-web-85c79c9c4f-djqqz 1/1 Running 0 62s k8s-triliovault-web-backend-5c8c67c548-pcgvl 1/1 Running 0 62s ``` ## Enabling ServiceMonitor for T4K Metrics The T4K exporter exposes Prometheus metrics on port 8080. You can enable a ServiceMonitor for automatic metrics discovery by Prometheus. ### Enabling ServiceMonitor via Helm Enable ServiceMonitor during T4K installation or upgrade: ```bash # New installation with ServiceMonitor enabled helm install tvm triliovault-operator/k8s-triliovault-operator \ --set observability.enabled=true \ --set installTVK.exporter.serviceMonitor.enabled=true ``` Or upgrade an existing installation: ```bash helm upgrade tvm triliovault-operator/k8s-triliovault-operator \ --set installTVK.exporter.serviceMonitor.enabled=true \ --reuse-values ``` ### ServiceMonitor Configuration Parameters

Parameter	Description	Default
`installTVK.exporter.enabled`	Enable/disable the metrics exporter	`true`
`installTVK.exporter.serviceMonitor.enabled`	Enable Prometheus ServiceMonitor for metrics collection	`false`
`installTVK.exporter.resources.requests.cpu`	CPU request for exporter pod	`50m`
`installTVK.exporter.resources.requests.memory`	Memory request for exporter pod	`512Mi`

When ServiceMonitor is enabled, the Helm chart creates: * A **Service** exposing the exporter metrics on port 8080 * A **ServiceMonitor** resource that configures Prometheus to scrape metrics from the exporter {% hint style="info" %} When `exporter.serviceMonitor.enabled` is set to `false` (default), the exporter pod includes Prometheus scrape annotations: * `prometheus.io/scrape: "true"` * `prometheus.io/path: /metrics` * `prometheus.io/port: "8080"` If your Prometheus is configured to discover targets via pod annotations, metrics will be collected automatically without a ServiceMonitor. {% endhint %} ### Verifying Metrics Collection After enabling the ServiceMonitor, verify that Prometheus is scraping T4K metrics: 1. Access Prometheus or Grafana UI 2. Query for T4K metrics: ```promql trilio_backup_info ``` 3. You should see metrics with labels like `backup`, `backupplan`, `resource_namespace`, etc. ## Alertmanager Configuration Alertmanager handles alerts sent by Prometheus server and manages routing, grouping, and notification. The observability stack includes Alertmanager as a sub-chart that can be enabled for T4K monitoring. ### Enabling Alertmanager To enable Alertmanager with the observability stack, set the following parameter during installation: ```bash helm install tvm triliovault-operator/k8s-triliovault-operator \ --set observability.enabled=true \ --set observability.monitoring.prometheus.alertmanager.enabled=true ``` ### Alertmanager Configurable Parameters The following table lists the Alertmanager-specific configuration parameters: | Parameter | Description | Default | | ----------------------------------------------------------------------------------- | ---------------------------------------------- | ------------------------------------------------------ | | `observability.monitoring.prometheus.alertmanager.enabled` | Enable Alertmanager | false | | `observability.monitoring.prometheus.alertmanager.image.repository` | Alertmanager container image repository | quay.io/prometheus/alertmanager | | `observability.monitoring.prometheus.alertmanager.configmapReload.image.repository` | Alertmanager configmap reload image repository | quay.io/prometheus-operator/prometheus-config-reloader | | `observability.monitoring.prometheus.alertmanager.replicaCount` | Number of Alertmanager replicas | 1 | | `observability.monitoring.prometheus.alertmanager.persistence.enabled` | Enable persistent storage for Alertmanager | true | | `observability.monitoring.prometheus.alertmanager.persistence.size` | Alertmanager persistent volume size | 50Mi | | `observability.monitoring.prometheus.alertmanager.persistence.accessModes` | Alertmanager persistent volume access modes | ReadWriteOnce | | `observability.monitoring.prometheus.alertmanager.service.type` | Alertmanager service type | ClusterIP | | `observability.monitoring.prometheus.alertmanager.service.port` | Alertmanager service port | 9093 | | `observability.monitoring.prometheus.alertmanager.ingress.enabled` | Enable ingress for Alertmanager | false | ### Minimal Alertmanager Configuration The following is a minimal Alertmanager configuration sample with basic routing: ```yaml # alertmanager-minimal-values.yaml observability: enabled: true monitoring: prometheus: alertmanager: enabled: true # Basic configuration replicaCount: 1 persistence: enabled: true size: 50Mi service: type: ClusterIP port: 9093 # Alertmanager configuration config: enabled: true global: resolve_timeout: 5m # Notification templates path templates: - '/etc/alertmanager/*.tmpl' # Define receivers (notification endpoints) receivers: - name: 'default-receiver' # Empty receiver - alerts go here but no notifications sent - name: 'null' # Explicitly ignore alerts # Routing tree route: # Default receiver receiver: 'default-receiver' # How long to wait before sending notification for a group group_wait: 30s # How long to wait before sending updated notification group_interval: 5m # How long to wait before re-sending notification repeat_interval: 4h # Group alerts by these labels group_by: ['alertname', 'namespace', 'severity'] # Child routes (optional) routes: # Silence watchdog alerts - match: alertname: Watchdog receiver: 'null' # Inhibition rules (optional) # Mute less severe alerts when critical ones are firing inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'namespace'] ``` Install with: ```bash helm install tvm triliovault-operator/k8s-triliovault-operator -f alertmanager-minimal-values.yaml ``` ### Example: Alertmanager with Slack Notifications The following example demonstrates how to configure Alertmanager to send alerts to a Slack channel: ```yaml # alertmanager-values.yaml observability: enabled: true monitoring: prometheus: enabled: true alertmanager: enabled: true persistence: enabled: true size: 100Mi config: enabled: true global: slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' receivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts' send_resolved: true title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}' - name: 'default-receiver' route: group_by: ['alertname', 'namespace'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'slack-notifications' ``` Install with the custom values file: ```bash helm install tvm triliovault-operator/k8s-triliovault-operator -f alertmanager-values.yaml ``` ### Example: Alertmanager with Email Notifications The following example demonstrates how to configure Alertmanager to send alerts via email: ```yaml # alertmanager-email-values.yaml observability: enabled: true monitoring: prometheus: enabled: true alertmanager: enabled: true persistence: enabled: true size: 100Mi config: enabled: true global: smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@example.com' smtp_auth_username: 'alertmanager@example.com' smtp_auth_password: 'your-smtp-password' receivers: - name: 'email-notifications' email_configs: - to: 'team@example.com' send_resolved: true - name: 'default-receiver' route: group_by: ['alertname', 'namespace'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'email-notifications' ``` ### Example: Alertmanager with PagerDuty Integration The following example demonstrates how to configure Alertmanager with PagerDuty for incident management: ```yaml # alertmanager-pagerduty-values.yaml observability: enabled: true monitoring: prometheus: enabled: true alertmanager: enabled: true persistence: enabled: true size: 100Mi config: enabled: true receivers: - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' severity: 'critical' - name: 'default-receiver' route: group_by: ['alertname', 'namespace', 'job'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'pagerduty-critical' ``` ### Example: Alertmanager with Custom Templates Alertmanager templates allow you to customize the format and content of notifications. The following example demonstrates how to create custom templates for T4K alerts: ```yaml # alertmanager-templates-values.yaml observability: enabled: true monitoring: prometheus: enabled: true alertmanager: enabled: true persistence: enabled: true size: 100Mi # Custom notification templates templates: t4k-alerts.tmpl: |- {{ define "t4k.title" -}} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} {{- end }} {{ define "t4k.text" -}} {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} *Status:* {{ .Status }} *Namespace:* {{ .Labels.namespace }} *Summary:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 MST" }} {{ if .EndsAt }}*Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05 MST" }}{{ end }} --- {{ end }} {{- end }} {{ define "t4k.slack.title" -}} {{ if eq .Status "firing" }}[FIRING]{{ else }}[RESOLVED]{{ end }} {{ template "t4k.title" . }} {{- end }} {{ define "t4k.slack.text" -}} {{ if eq .Status "firing" }} *FIRING ALERTS:* {{ range .Alerts.Firing }} - *{{ .Labels.alertname }}* ({{ .Labels.severity }}) Namespace: `{{ .Labels.namespace }}` {{ .Annotations.summary }} {{ end }} {{ end }} {{ if .Alerts.Resolved }} *RESOLVED ALERTS:* {{ range .Alerts.Resolved }} - *{{ .Labels.alertname }}* - {{ .Annotations.summary }} {{ end }} {{ end }} {{- end }} {{ define "t4k.email.subject" -}} [{{ .Status | toUpper }}] TrilioVault Alert: {{ .CommonLabels.alertname }} {{- end }} {{ define "t4k.email.html" -}}

TrilioVault Alert Notification

Alert: {{ .Labels.alertname }}

Severity: {{ .Labels.severity }}

Status: {{ .Status }}

Namespace: {{ .Labels.namespace }}

Summary: {{ .Annotations.summary }}

Description: {{ .Annotations.description }}

{{ end }} {{- end }} config: enabled: true global: slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' templates: - '/etc/alertmanager/*.tmpl' receivers: - name: 'default-receiver' - name: 'slack-t4k-alerts' slack_configs: - channel: '#t4k-alerts' send_resolved: true title: '{{ template "t4k.slack.title" . }}' text: '{{ template "t4k.slack.text" . }}' color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' - name: 'email-t4k-alerts' email_configs: - to: 'backup-team@example.com' send_resolved: true headers: Subject: '{{ template "t4k.email.subject" . }}' html: '{{ template "t4k.email.html" . }}' route: group_by: ['alertname', 'namespace'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'slack-t4k-alerts' continue: true - match: severity: critical receiver: 'email-t4k-alerts' ``` #### Template Functions Reference The following template functions are commonly used in Alertmanager templates: | Function | Description | Example | | -------------- | ---------------------------------- | ------------------------------------------------------ | | `toUpper` | Converts string to uppercase | `{{ .Status \| toUpper }}` | | `toLower` | Converts string to lowercase | `{{ .Labels.severity \| toLower }}` | | `title` | Converts string to title case | `{{ .Labels.alertname \| title }}` | | `join` | Joins list elements with separator | `{{ .Labels.Values \| join ", " }}` | | `safeHtml` | Marks string as safe HTML | `{{ .Annotations.description \| safeHtml }}` | | `reReplaceAll` | Regex replace | `{{ reReplaceAll "(.*):(.*)" "$1" .Labels.instance }}` | #### Template Variables Common variables available in templates: | Variable | Description | | -------------------- | ------------------------------------- | | `.Status` | Alert status ("firing" or "resolved") | | `.Alerts` | List of all alerts in the group | | `.Alerts.Firing` | List of currently firing alerts | | `.Alerts.Resolved` | List of resolved alerts | | `.CommonLabels` | Labels common to all alerts | | `.CommonAnnotations` | Annotations common to all alerts | | `.ExternalURL` | URL to Alertmanager | | `.GroupLabels` | Labels used for grouping | ### Example: Using Kubernetes Secrets for Credentials For production environments, it's recommended to store sensitive credentials (like Slack webhook URLs, SMTP passwords, or PagerDuty keys) in Kubernetes Secrets instead of hardcoding them in helm values. #### Step 1: Create Kubernetes Secret First, create a secret containing your sensitive credentials: ```bash # Create secret with Slack webhook URL kubectl create secret generic alertmanager-secrets \ --namespace= \ --from-literal=slack-webhook-url='https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' \ --from-literal=smtp-password='your-smtp-password' \ --from-literal=pagerduty-key='YOUR_PAGERDUTY_SERVICE_KEY' ``` Or using a YAML manifest: ```yaml # alertmanager-secrets.yaml apiVersion: v1 kind: Secret metadata: name: alertmanager-secrets namespace: type: Opaque stringData: slack-webhook-url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" smtp-password: "your-smtp-password" pagerduty-key: "YOUR_PAGERDUTY_SERVICE_KEY" smtp-auth-identity: "alertmanager@example.com" ``` Apply the secret: ```bash kubectl apply -f alertmanager-secrets.yaml ``` #### Step 2: Configure Alertmanager to Use Secret Configure Alertmanager to mount the secret and reference credentials from environment variables or files: ```yaml # alertmanager-with-secrets-values.yaml observability: enabled: true monitoring: prometheus: enabled: true alertmanager: enabled: true persistence: enabled: true size: 100Mi # Mount the secret as environment variables extraEnv: - name: SLACK_WEBHOOK_URL valueFrom: secretKeyRef: name: alertmanager-secrets key: slack-webhook-url - name: SMTP_PASSWORD valueFrom: secretKeyRef: name: alertmanager-secrets key: smtp-password - name: PAGERDUTY_KEY valueFrom: secretKeyRef: name: alertmanager-secrets key: pagerduty-key # Alternatively, mount secret as files extraSecretMounts: - name: alertmanager-secrets mountPath: /etc/alertmanager/secrets secretName: alertmanager-secrets readOnly: true config: enabled: true global: # Reference Slack webhook from environment variable slack_api_url_file: '/etc/alertmanager/secrets/slack-webhook-url' # SMTP configuration with password from secret smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@example.com' smtp_auth_username: 'alertmanager@example.com' smtp_auth_password_file: '/etc/alertmanager/secrets/smtp-password' receivers: - name: 'default-receiver' - name: 'slack-notifications' slack_configs: - channel: '#alerts' send_resolved: true # api_url can also be set per-receiver using file reference api_url_file: '/etc/alertmanager/secrets/slack-webhook-url' - name: 'pagerduty-notifications' pagerduty_configs: - service_key_file: '/etc/alertmanager/secrets/pagerduty-key' severity: 'critical' - name: 'email-notifications' email_configs: - to: 'team@example.com' send_resolved: true route: group_by: ['alertname', 'namespace'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'slack-notifications' continue: true - match: severity: critical receiver: 'pagerduty-notifications' - match: severity: warning receiver: 'email-notifications' ``` #### Secret File Reference Options Alertmanager supports `_file` suffix for many credential fields, which reads the value from a file: | Original Field | File Reference Field | Description | | -------------------- | ------------------------- | ------------------------------ | | `slack_api_url` | `slack_api_url_file` | Global Slack webhook URL | | `api_url` | `api_url_file` | Per-receiver Slack webhook URL | | `smtp_auth_password` | `smtp_auth_password_file` | SMTP password | | `smtp_auth_identity` | `smtp_auth_identity_file` | SMTP identity | | `smtp_auth_secret` | `smtp_auth_secret_file` | SMTP secret | | `service_key` | `service_key_file` | PagerDuty service key | | `routing_key` | `routing_key_file` | PagerDuty routing key | | `token` | `token_file` | Opsgenie/VictorOps token | | `url` | `url_file` | Webhook URL | #### Example: Complete Setup with External Secrets Operator For organizations using External Secrets Operator (ESO) to sync secrets from external secret managers (AWS Secrets Manager, HashiCorp Vault, etc.): ```yaml # external-secret.yaml apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: alertmanager-secrets namespace: spec: refreshInterval: 1h secretStoreRef: name: aws-secrets-manager kind: ClusterSecretStore target: name: alertmanager-secrets creationPolicy: Owner data: - secretKey: slack-webhook-url remoteRef: key: /production/alertmanager/slack-webhook - secretKey: smtp-password remoteRef: key: /production/alertmanager/smtp-password - secretKey: pagerduty-key remoteRef: key: /production/alertmanager/pagerduty-key ``` Apply the ExternalSecret: ```bash kubectl apply -f external-secret.yaml ``` The External Secrets Operator will automatically create and sync the `alertmanager-secrets` Kubernetes Secret from your external secret manager. {% hint style="warning" %} **Security Best Practices:** * Never commit secrets to version control * Use `_file` references instead of inline credentials * Rotate secrets regularly * Use RBAC to restrict access to secrets * Consider using sealed-secrets or external-secrets-operator for GitOps workflows {% endhint %} ### Example: Complete Observability Stack with Alertmanager The following is a comprehensive example enabling the full observability stack with Alertmanager, custom alerting rules, and persistent storage: ```yaml # full-observability-values.yaml observability: enabled: true name: "t4k-observability" logging: loki: enabled: true fullnameOverride: "loki" singleBinary: replicas: 1 persistence: enabled: true size: 20Gi promtail: enabled: true fullnameOverride: "promtail" monitoring: prometheus: enabled: true fullnameOverride: "prometheus" server: enabled: true fullnameOverride: "prometheus-server" persistentVolume: enabled: true size: 10Gi retention: "30d" # Enable kube-state-metrics for Kubernetes metrics kube-state-metrics: enabled: true # Enable node-exporter for node-level metrics prometheus-node-exporter: enabled: true alertmanager: enabled: true replicaCount: 1 persistence: enabled: true size: 100Mi # Alertmanager configuration config: enabled: true global: slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' resolve_timeout: 5m templates: - '/etc/alertmanager/*.tmpl' receivers: - name: 'default-receiver' - name: 'slack-critical' slack_configs: - channel: '#critical-alerts' send_resolved: true title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}' text: >- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Severity:* {{ .Labels.severity }} *Namespace:* {{ .Labels.namespace }} {{ end }} - name: 'slack-warning' slack_configs: - channel: '#warning-alerts' send_resolved: true route: group_by: ['alertname', 'namespace', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'slack-critical' - match: severity: warning receiver: 'slack-warning' # Custom alerting rules for T4K serverFiles: alerting_rules.yml: groups: - name: t4k-backup-alerts rules: # Alert when backup fails (metric value -1 indicates Failed/Error status) - alert: T4KBackupFailed expr: trilio_backup_info == -1 for: 1m labels: severity: critical annotations: summary: "TrilioVault Backup Failed" description: "Backup {{ $labels.backup }} has failed in namespace {{ $labels.resource_namespace }} on cluster {{ $labels.cluster }}" # Alert when backup is stuck in progress for too long - alert: T4KBackupStuck expr: trilio_backup_info{status="InProgress"} == 0 and trilio_backup_status_percentage < 100 for: 60m labels: severity: warning annotations: summary: "TrilioVault Backup Stuck" description: "Backup {{ $labels.backup }} has been in progress for more than 60 minutes in namespace {{ $labels.resource_namespace }}. Current progress: {{ $value }}%" # Alert when backup takes unusually long time - alert: T4KBackupDurationHigh expr: trilio_backup_completed_duration > 120 for: 5m labels: severity: warning annotations: summary: "TrilioVault Backup Duration High" description: "Backup {{ $labels.backup }} took {{ $value }} minutes to complete in namespace {{ $labels.resource_namespace }}" - name: t4k-restore-alerts rules: # Alert when restore fails - alert: T4KRestoreFailed expr: trilio_restore_info == -1 for: 1m labels: severity: critical annotations: summary: "TrilioVault Restore Failed" description: "Restore {{ $labels.restore }} has failed in namespace {{ $labels.resource_namespace }} on cluster {{ $labels.cluster }}" # Alert when restore is stuck - alert: T4KRestoreStuck expr: trilio_restore_info{status="InProgress"} == 0 and trilio_restore_status_percentage < 100 for: 60m labels: severity: warning annotations: summary: "TrilioVault Restore Stuck" description: "Restore {{ $labels.restore }} has been in progress for more than 60 minutes in namespace {{ $labels.resource_namespace }}" - name: t4k-target-alerts rules: # Alert when target is unavailable (metric value 0 indicates unavailable) - alert: T4KTargetUnavailable expr: trilio_target_info == 0 for: 5m labels: severity: critical annotations: summary: "TrilioVault Target Unavailable" description: "Target {{ $labels.target }} is not available in namespace {{ $labels.resource_namespace }}. Status: {{ $labels.status }}" # Alert when target storage exceeds threshold (example: 500GB) - alert: T4KTargetStorageHigh expr: trilio_target_storage > 500000000000 for: 10m labels: severity: warning annotations: summary: "TrilioVault Target Storage High" description: "Target {{ $labels.target }} storage usage is {{ $value | humanize1024 }}B in namespace {{ $labels.resource_namespace }}" - name: t4k-backupplan-alerts rules: # Alert when BackupPlan has no successful backups (not protected) - alert: T4KBackupPlanNotProtected expr: trilio_backupplan_info{protected="False"} == 1 for: 24h labels: severity: warning annotations: summary: "TrilioVault BackupPlan Not Protected" description: "BackupPlan {{ $labels.backupplan }} in namespace {{ $labels.resource_namespace }} has no successful backups for more than 24 hours" # Alert when BackupPlan fails - alert: T4KBackupPlanFailed expr: trilio_backupplan_info == -1 for: 5m labels: severity: critical annotations: summary: "TrilioVault BackupPlan Failed" description: "BackupPlan {{ $labels.backupplan }} has failed in namespace {{ $labels.resource_namespace }}" - name: t4k-continuous-restore-alerts rules: # Alert when ContinuousRestorePlan fails - alert: T4KContinuousRestorePlanFailed expr: trilio_continuousrestoreplan_info == -1 for: 5m labels: severity: critical annotations: summary: "TrilioVault ContinuousRestorePlan Failed" description: "ContinuousRestorePlan {{ $labels.continuousrestoreplan }} has failed on cluster {{ $labels.cluster }}" # Alert when ConsistentSet fails - alert: T4KConsistentSetFailed expr: trilio_consistentset_info == -1 for: 5m labels: severity: critical annotations: summary: "TrilioVault ConsistentSet Failed" description: "ConsistentSet {{ $labels.consistentset }} has failed for ContinuousRestorePlan {{ $labels.continuousrestoreplan }}" visualization: grafana: enabled: true fullnameOverride: "grafana" adminPassword: "your-secure-password" service: type: ClusterIP ``` Install with: ```bash helm install tvm triliovault-operator/k8s-triliovault-operator -f full-observability-values.yaml ``` ### Verifying Alertmanager Installation After installation, verify that Alertmanager is running: ```bash kubectl get pods -n | grep alertmanager # Expected output: # -alertmanager-0 1/1 Running 0 2m ``` Access the Alertmanager UI: ```bash # Port forward to access locally kubectl port-forward svc/alertmanager -n 9093:9093 ``` Then open your browser to `http://localhost:9093` to view the Alertmanager UI. ## T4K Metrics Reference Trilio for Kubernetes (T4K) exports Prometheus metrics through the `k8s-triliovault-exporter` component. These metrics can be used for monitoring, alerting, and dashboarding. ### Metric Value Conventions For status-based metrics (`*_info` metrics), the numeric value indicates the status: | Status | Metric Value | Description | | ------------------------- | ------------ | ------------------------------ | | `Available` / `Completed` | `1` | Resource is healthy/successful | | `Failed` / `Error` | `-1` | Resource has failed | | `InProgress` | `0` | Operation is in progress | | Empty/Unknown | `-2` | Status not yet determined | ### Available Metrics #### Backup Metrics

Metric Name	Description	Key Labels
`trilio_backup_info`	Backup status and metadata	`backup`, `backupplan`, `resource_namespace`, `status`, `target`, `backup_type`, `start_ts`, `completion_ts`, `size`, `cluster`, `kind`, `hook`, `backupscope`, `applicationtype`
`trilio_backup_storage`	Backup size in bytes	`backup`, `backupplan`, `resource_namespace`, `status`, `target`, `backup_type`, `cluster`, `kind`
`trilio_backup_status_percentage`	Backup progress (0-100)	`backup`, `backupplan`, `resource_namespace`, `status`, `target`, `backup_type`, `cluster`, `kind`
`trilio_backup_completed_duration`	Backup duration in minutes (only for completed backups)	`backup`, `backupplan`, `resource_namespace`, `status`, `target`, `backup_type`, `cluster`, `kind`
`trilio_backup_metadata_info`	Detailed backup object metadata	`backup`, `backupplan`, `resource_namespace`, `status`, `objecttype`, `objectname`, `backupscope`, `applicationtype`, `apiversion`, `apigroup`, `object_resource`

#### Restore Metrics

Metric Name	Description	Key Labels
`trilio_restore_info`	Restore status and metadata	`restore`, `backup`, `backupplan`, `resource_namespace`, `status`, `target`, `size`, `start_ts`, `completion_ts`, `cluster`, `kind`
`trilio_restore_status_percentage`	Restore progress (0-100)	`restore`, `backup`, `resource_namespace`, `status`, `target`, `cluster`, `kind`
`trilio_restore_completed_duration`	Restore duration in minutes (only for completed restores)	`restore`, `backup`, `resource_namespace`, `status`, `target`, `cluster`, `kind`

#### Target Metrics | Metric Name | Description | Key Labels | | ----------------------- | ------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | | `trilio_target_info` | Target availability status (1=available, 0=unavailable) | `target`, `resource_namespace`, `status`, `vendor`, `vendorType`, `browsing`, `eventTarget`, `size`, `threshold_capacity`, `creation_ts`, `cluster` | | `trilio_target_storage` | Storage used by target in bytes | `target`, `resource_namespace`, `status`, `vendor`, `vendorType`, `threshold_capacity`, `creation_ts`, `cluster` | #### BackupPlan Metrics | Metric Name | Description | Key Labels | | ---------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `trilio_backupplan_info` | BackupPlan status and summary | `backupplan`, `resource_namespace`, `status`, `target`, `protected`, `backup_count`, `lastprotected`, `backupscope`, `applicationtype`, `creation_ts`, `cluster`, `kind` | | `trilio_backupplan_crstatus` | BackupPlan continuous restore status | `backupplan`, `continuousrestoreinstance`, `continuousrestore_enabled`, `continuousrestoreplan`, `consistentset_count`, `cr_status`, `cluster`, `kind` | #### Continuous Restore Metrics

Metric Name	Description	Key Labels
`trilio_continuousrestoreplan_info`	ContinuousRestorePlan status	`continuousrestoreplan`, `continuousrestorepolicy`, `target`, `consistentsetcount`, `sourcebackupplan`, `sourceinstanceinfo`, `status`, `creation_ts`, `cluster`, `kind`
`trilio_consistentset_info`	ConsistentSet status and details	`consistentset`, `consistentsetscope`, `continuousrestoreplan`, `sourcebackupplan`, `sourceinstanceinfo`, `backupName`, `backupNamespace`, `backupStatus`, `backupSize`, `status`, `size`, `cluster`, `kind`
`trilio_consistentset_status_percentage`	ConsistentSet progress (0-100)	`consistentset`, `consistentsetscope`, `continuousrestoreplan`, `sourcebackupplan`, `sourceinstanceinfo`, `backupName`, `status`, `cluster`, `kind`
`trilio_consistentset_completed_duration`	ConsistentSet duration in minutes	`consistentset`, `consistentsetscope`, `continuousrestoreplan`, `sourcebackupplan`, `sourceinstanceinfo`, `backupName`, `status`, `cluster`, `kind`

### Example PromQL Queries ```promql # Count of failed backups by namespace count(trilio_backup_info == -1) by (resource_namespace) # List all successful backups trilio_backup_info{status="Available"} # Total backup storage per target sum(trilio_backup_storage) by (target) # Average backup duration by backupplan avg(trilio_backup_completed_duration) by (backupplan) # Unavailable targets trilio_target_info == 0 # BackupPlans without successful backups trilio_backupplan_info{protected="False"} # Failed restores in last 24 hours trilio_restore_info == -1 # ContinuousRestorePlan replication lag (ConsistentSets in progress) trilio_consistentset_info{status="InProgress"} ``` ### Viewing Alert Rules in Grafana Once alert rules are configured, you can view and manage them directly from the Grafana UI. Navigate to **Alerting > Alert rules** to see all configured rules, their current state, and firing alerts.

The Alert rules page shows: * **Data source-managed rules**: Alert rules defined in Prometheus configuration (e.g., `/etc/config/alerting_rules.yml`) * **State**: Current state of each alert (Firing, Normal, Pending, Recovering) * **Health**: Health status of the alert rule * **Summary**: Brief description of what the alert monitors You can filter alerts by data source, dashboard, state, rule type, health status, and contact point. ### View Logs From T4K UI * Login to T4K UI with preferred authentication * Select "Launch Event Viewer" on any required service or application

* On click on "Launch Event Viewer" option, user will be redirected to Logs visibility page.

### Accessing Grafana Dashboards ``` Grafana Endpoint : http:///grafana Login with default grafana credentials. username: admin password: admin123 ``` {% hint style="info" %} if a custom path is configured then: Grafana Endpoint : http\://\/\/grafana {% endhint %}

## Additional Monitoring Components ### Kube-State-Metrics Kube-state-metrics generates metrics about the state of Kubernetes objects. Enable it to get comprehensive cluster metrics: ```yaml observability: enabled: true monitoring: prometheus: kube-state-metrics: enabled: true ``` ### Node Exporter Node Exporter exposes hardware and OS metrics from the host machines: ```yaml observability: enabled: true monitoring: prometheus: prometheus-node-exporter: enabled: true ``` ### Pushgateway Pushgateway allows ephemeral and batch jobs to expose metrics to Prometheus: ```yaml observability: enabled: true monitoring: prometheus: prometheus-pushgateway: enabled: true ```