T4K Integration with Observability Stack

Introduction

The Observability Stack is a pre-packaged distribution for monitoring, logging, and dashboarding and can be installed into any existing Kubernetes cluster. It includes many of the most popular open-source observability tools with Prometheus, Grafana, Promtail**,** and Loki. The observability stack provides a straightforward, maintainable solution for analyzing server traffic and identifying potential deployment problems.

T4K Installation with Observability using Trilio Operator

To install the operator with observability enabled, run the latest helm chart with the following parameter set.

helm repo add triliovault-operator https://charts.k8strilio.net/trilio-stable/k8s-triliovault-operator
helm install tvm triliovault-operator/k8s-triliovault-operator --set observability.enabled=true

Observability Stack Configurable Parameters

The following table lists the configuration parameters of the observability stack

Parameter

Description

Default

observability.enabled

observability stack is enabled

false

observability.name

observability name for T4K integration

tvk-integration

observability.logging.loki.enabled

logging stack, loki is enabled

true

observability.logging.loki.fullnameOverride

name of the loki service

"loki"

observability.logging.loki.singleBinary.persistence.enabled

loki persistence storage enabled

true

observability.logging.loki.singleBinary.persistence.accessModes

loki persistence storage accessModes

ReadWriteOnce

observability.logging.loki.singleBinary.persistence.size

loki persistence storage size

10Gi

observability.logging.loki.loki.limits_config.reject_old_samples_max_age

loki config, maximum accepted sample age before rejecting

168h

observability.logging.loki.tableManager.retention_period

loki config, how far back tables will be kept before they are deleted. 0s disables deletion.

168h

observability.logging.promtail.enabled

logging stack, promtail is enabled

true

observability.logging.promtail.fullnameOverride

name of the promtail service

"promtail"

observability.logging.promtail.config.clients.url

loki url for promtail integration

"http://loki:3100/loki/api/v1/push"

observability.monitoring.prometheus.enabled

monitoring stack, prometheus is enabled

true

observability.monitoring.prometheus.fullnameOverride

name of the prometheus service

"prom"

observability.monitoring.prometheus.server.enabled

prometheus server is enabled

true

observability.monitoring.prometheus.server.fullnameOverride

name of prometheus server service

"prom-server"

observability.monitoring.prometheus.server.persistentVolume.enabled

prometheus server with persistent volume is enabled

false

observability.monitoring.prometheus.kube-state-metrics.enabled

prometheus kube state metrics is enabled

false

observability.monitoring.prometheus.prometheus-node-exporter.enabled

prometheus node exporter is enabled

false

observability.monitoring.prometheus.prometheus-pushgateway.enabled

prometheus push gateway is enabled

false

observability.monitoring.prometheus.alertmanager.enabled

prometheus alert manager is enabled

false

observability.visualization.grafana.enabled

visualization stack, grafana is enabled

true

observability.visualization.grafana.adminPassword

grafana password for admin user

"admin123"

observability.visualization.grafana.fullnameOverride

name of grafana service

"grafana"

observability.visualization.grafana.service.type

grafana service type

"ClusterIP"

Check the observability stack configuration by running the following command:

kubectl get pods -n <install_ns>

promtail-2zpcv                                              1/1     Running            0          2m16s
grafana-554cb4f55-q4q59                                     3/3     Running            0          2m15s
prom-server-786b8cf897-nglhh                                2/2     Running            0          2m15s
k8s-triliovault-operator-85dfc877b8-5xqx9                   1/1     Running            0          2m15s
loki-0                                                      1/1     Running            0          2m15s
k8s-triliovault-admission-webhook-96db687bb-wnfh7           1/1     Running            0          62s
k8s-triliovault-control-plane-6b986c8fb9-zjbnj              2/2     Running            0          62s
k8s-triliovault-exporter-7b98cb7678-wxwvx                   1/1     Running            0          62s
k8s-triliovault-ingress-nginx-controller-57b777f45b-dnjkv   1/1     Running            0          62s
k8s-triliovault-web-85c79c9c4f-djqqz                        1/1     Running            0          62s
k8s-triliovault-web-backend-5c8c67c548-pcgvl                1/1     Running            0          62s

Enabling ServiceMonitor for T4K Metrics

The T4K exporter exposes Prometheus metrics on port 8080. You can enable a ServiceMonitor for automatic metrics discovery by Prometheus.

Enabling ServiceMonitor via Helm

Enable ServiceMonitor during T4K installation or upgrade:

# New installation with ServiceMonitor enabled
helm install tvm triliovault-operator/k8s-triliovault-operator \
  --set observability.enabled=true \
  --set installTVK.exporter.serviceMonitor.enabled=true

Or upgrade an existing installation:

helm upgrade tvm triliovault-operator/k8s-triliovault-operator \
  --set installTVK.exporter.serviceMonitor.enabled=true \
  --reuse-values

ServiceMonitor Configuration Parameters

Parameter

Description

Default

installTVK.exporter.enabled

Enable/disable the metrics exporter

true

installTVK.exporter.serviceMonitor.enabled

Enable Prometheus ServiceMonitor for metrics collection

false

installTVK.exporter.resources.requests.cpu

CPU request for exporter pod

50m

installTVK.exporter.resources.requests.memory

Memory request for exporter pod

512Mi

When ServiceMonitor is enabled, the Helm chart creates:

A Service exposing the exporter metrics on port 8080
A ServiceMonitor resource that configures Prometheus to scrape metrics from the exporter

When exporter.serviceMonitor.enabled is set to false (default), the exporter pod includes Prometheus scrape annotations:

prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8080"

If your Prometheus is configured to discover targets via pod annotations, metrics will be collected automatically without a ServiceMonitor.

Verifying Metrics Collection

After enabling the ServiceMonitor, verify that Prometheus is scraping T4K metrics:

Access Prometheus or Grafana UI
Query for T4K metrics:
```
trilio_backup_info
```
You should see metrics with labels like backup, backupplan, resource_namespace, etc.

Alertmanager Configuration

Alertmanager handles alerts sent by Prometheus server and manages routing, grouping, and notification. The observability stack includes Alertmanager as a sub-chart that can be enabled for T4K monitoring.

Enabling Alertmanager

To enable Alertmanager with the observability stack, set the following parameter during installation:

helm install tvm triliovault-operator/k8s-triliovault-operator \
  --set observability.enabled=true \
  --set observability.monitoring.prometheus.alertmanager.enabled=true

Alertmanager Configurable Parameters

The following table lists the Alertmanager-specific configuration parameters:

Parameter

Description

Default

observability.monitoring.prometheus.alertmanager.enabled

Enable Alertmanager

false

observability.monitoring.prometheus.alertmanager.image.repository

Alertmanager container image repository

quay.io/prometheus/alertmanager

observability.monitoring.prometheus.alertmanager.configmapReload.image.repository

Alertmanager configmap reload image repository

quay.io/prometheus-operator/prometheus-config-reloader

observability.monitoring.prometheus.alertmanager.replicaCount

Number of Alertmanager replicas

observability.monitoring.prometheus.alertmanager.persistence.enabled

Enable persistent storage for Alertmanager

true

observability.monitoring.prometheus.alertmanager.persistence.size

Alertmanager persistent volume size

50Mi

observability.monitoring.prometheus.alertmanager.persistence.accessModes

Alertmanager persistent volume access modes

ReadWriteOnce

observability.monitoring.prometheus.alertmanager.service.type

Alertmanager service type

ClusterIP

observability.monitoring.prometheus.alertmanager.service.port

Alertmanager service port

9093

observability.monitoring.prometheus.alertmanager.ingress.enabled

Enable ingress for Alertmanager

false

Minimal Alertmanager Configuration

The following is a minimal Alertmanager configuration sample with basic routing:

# alertmanager-minimal-values.yaml
observability:
  enabled: true
  monitoring:
    prometheus:
      alertmanager:
        enabled: true
        
        # Basic configuration
        replicaCount: 1
        
        persistence:
          enabled: true
          size: 50Mi
        
        service:
          type: ClusterIP
          port: 9093
        
        # Alertmanager configuration
        config:
          enabled: true
          
          global:
            resolve_timeout: 5m
          
          # Notification templates path
          templates:
            - '/etc/alertmanager/*.tmpl'
          
          # Define receivers (notification endpoints)
          receivers:
            - name: 'default-receiver'
              # Empty receiver - alerts go here but no notifications sent
            
            - name: 'null'
              # Explicitly ignore alerts
          
          # Routing tree
          route:
            # Default receiver
            receiver: 'default-receiver'
            
            # How long to wait before sending notification for a group
            group_wait: 30s
            
            # How long to wait before sending updated notification
            group_interval: 5m
            
            # How long to wait before re-sending notification
            repeat_interval: 4h
            
            # Group alerts by these labels
            group_by: ['alertname', 'namespace', 'severity']
            
            # Child routes (optional)
            routes:
              # Silence watchdog alerts
              - match:
                  alertname: Watchdog
                receiver: 'null'
          
          # Inhibition rules (optional)
          # Mute less severe alerts when critical ones are firing
          inhibit_rules:
            - source_match:
                severity: 'critical'
              target_match:
                severity: 'warning'
              equal: ['alertname', 'namespace']

Install with:

helm install tvm triliovault-operator/k8s-triliovault-operator -f alertmanager-minimal-values.yaml

Example: Alertmanager with Slack Notifications

The following example demonstrates how to configure Alertmanager to send alerts to a Slack channel:

# alertmanager-values.yaml
observability:
  enabled: true
  monitoring:
    prometheus:
      enabled: true
      alertmanager:
        enabled: true
        persistence:
          enabled: true
          size: 100Mi
        config:
          enabled: true
          global:
            slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
          receivers:
            - name: 'slack-notifications'
              slack_configs:
                - channel: '#alerts'
                  send_resolved: true
                  title: '{{ template "slack.default.title" . }}'
                  text: '{{ template "slack.default.text" . }}'
            - name: 'default-receiver'
          route:
            group_by: ['alertname', 'namespace']
            group_wait: 30s
            group_interval: 5m
            repeat_interval: 4h
            receiver: 'default-receiver'
            routes:
              - match:
                  severity: critical
                receiver: 'slack-notifications'

Install with the custom values file:

helm install tvm triliovault-operator/k8s-triliovault-operator -f alertmanager-values.yaml

Example: Alertmanager with Email Notifications

The following example demonstrates how to configure Alertmanager to send alerts via email:

# alertmanager-email-values.yaml
observability:
  enabled: true
  monitoring:
    prometheus:
      enabled: true
      alertmanager:
        enabled: true
        persistence:
          enabled: true
          size: 100Mi
        config:
          enabled: true
          global:
            smtp_smarthost: 'smtp.example.com:587'
            smtp_from: '[email protected]'
            smtp_auth_username: '[email protected]'
            smtp_auth_password: 'your-smtp-password'
          receivers:
            - name: 'email-notifications'
              email_configs:
                - to: '[email protected]'
                  send_resolved: true
            - name: 'default-receiver'
          route:
            group_by: ['alertname', 'namespace']
            group_wait: 30s
            group_interval: 5m
            repeat_interval: 4h
            receiver: 'default-receiver'
            routes:
              - match:
                  severity: critical
                receiver: 'email-notifications'

Example: Alertmanager with PagerDuty Integration

The following example demonstrates how to configure Alertmanager with PagerDuty for incident management:

# alertmanager-pagerduty-values.yaml
observability:
  enabled: true
  monitoring:
    prometheus:
      enabled: true
      alertmanager:
        enabled: true
        persistence:
          enabled: true
          size: 100Mi
        config:
          enabled: true
          receivers:
            - name: 'pagerduty-critical'
              pagerduty_configs:
                - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
                  severity: 'critical'
            - name: 'default-receiver'
          route:
            group_by: ['alertname', 'namespace', 'job']
            group_wait: 30s
            group_interval: 5m
            repeat_interval: 4h
            receiver: 'default-receiver'
            routes:
              - match:
                  severity: critical
                receiver: 'pagerduty-critical'

Example: Alertmanager with Custom Templates

Alertmanager templates allow you to customize the format and content of notifications. The following example demonstrates how to create custom templates for T4K alerts:

# alertmanager-templates-values.yaml
observability:
  enabled: true
  monitoring:
    prometheus:
      enabled: true
      alertmanager:
        enabled: true
        persistence:
          enabled: true
          size: 100Mi
        
        # Custom notification templates
        templates:
          t4k-alerts.tmpl: |-
            {{ define "t4k.title" -}}
            [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
            {{- end }}

            {{ define "t4k.text" -}}
            {{ range .Alerts }}
            *Alert:* {{ .Labels.alertname }}
            *Severity:* {{ .Labels.severity }}
            *Status:* {{ .Status }}
            *Namespace:* {{ .Labels.namespace }}
            *Summary:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 MST" }}
            {{ if .EndsAt }}*Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05 MST" }}{{ end }}
            ---
            {{ end }}
            {{- end }}

            {{ define "t4k.slack.title" -}}
            {{ if eq .Status "firing" }}[FIRING]{{ else }}[RESOLVED]{{ end }} {{ template "t4k.title" . }}
            {{- end }}

            {{ define "t4k.slack.text" -}}
            {{ if eq .Status "firing" }}
            *FIRING ALERTS:*
            {{ range .Alerts.Firing }}
            - *{{ .Labels.alertname }}* ({{ .Labels.severity }})
              Namespace: `{{ .Labels.namespace }}`
              {{ .Annotations.summary }}
            {{ end }}
            {{ end }}
            {{ if .Alerts.Resolved }}
            *RESOLVED ALERTS:*
            {{ range .Alerts.Resolved }}
            - *{{ .Labels.alertname }}* - {{ .Annotations.summary }}
            {{ end }}
            {{ end }}
            {{- end }}

            {{ define "t4k.email.subject" -}}
            [{{ .Status | toUpper }}] TrilioVault Alert: {{ .CommonLabels.alertname }}
            {{- end }}

            {{ define "t4k.email.html" -}}
            <!DOCTYPE html>
            <html>
            <head>
              <style>
                body { font-family: Arial, sans-serif; }
                .alert { padding: 10px; margin: 10px 0; border-radius: 5px; }
                .firing { background-color: #ffebee; border-left: 4px solid #f44336; }
                .resolved { background-color: #e8f5e9; border-left: 4px solid #4caf50; }
                .label { font-weight: bold; color: #333; }
              </style>
            </head>
            <body>
              <h2>TrilioVault Alert Notification</h2>
              {{ range .Alerts }}
              <div class="alert {{ .Status }}">
                <p><span class="label">Alert:</span> {{ .Labels.alertname }}</p>
                <p><span class="label">Severity:</span> {{ .Labels.severity }}</p>
                <p><span class="label">Status:</span> {{ .Status }}</p>
                <p><span class="label">Namespace:</span> {{ .Labels.namespace }}</p>
                <p><span class="label">Summary:</span> {{ .Annotations.summary }}</p>
                <p><span class="label">Description:</span> {{ .Annotations.description }}</p>
              </div>
              {{ end }}
            </body>
            </html>
            {{- end }}
        
        config:
          enabled: true
          global:
            slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
          
          templates:
            - '/etc/alertmanager/*.tmpl'
          
          receivers:
            - name: 'default-receiver'
            - name: 'slack-t4k-alerts'
              slack_configs:
                - channel: '#t4k-alerts'
                  send_resolved: true
                  title: '{{ template "t4k.slack.title" . }}'
                  text: '{{ template "t4k.slack.text" . }}'
                  color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
            - name: 'email-t4k-alerts'
              email_configs:
                - to: '[email protected]'
                  send_resolved: true
                  headers:
                    Subject: '{{ template "t4k.email.subject" . }}'
                  html: '{{ template "t4k.email.html" . }}'
          
          route:
            group_by: ['alertname', 'namespace']
            group_wait: 30s
            group_interval: 5m
            repeat_interval: 4h
            receiver: 'default-receiver'
            routes:
              - match:
                  severity: critical
                receiver: 'slack-t4k-alerts'
                continue: true
              - match:
                  severity: critical
                receiver: 'email-t4k-alerts'

Template Functions Reference

The following template functions are commonly used in Alertmanager templates:

Function

Description

Example

toUpper

Converts string to uppercase

{{ .Status | toUpper }}

toLower

Converts string to lowercase

{{ .Labels.severity | toLower }}

title

Converts string to title case

{{ .Labels.alertname | title }}

join

Joins list elements with separator

{{ .Labels.Values | join ", " }}

safeHtml

Marks string as safe HTML

{{ .Annotations.description | safeHtml }}

reReplaceAll

Regex replace

{{ reReplaceAll "(.*):(.*)" "$1" .Labels.instance }}

Template Variables

Common variables available in templates:

Variable

Description

.Status

Alert status ("firing" or "resolved")

.Alerts

List of all alerts in the group

.Alerts.Firing

List of currently firing alerts

.Alerts.Resolved

List of resolved alerts

.CommonLabels

Labels common to all alerts

.CommonAnnotations

Annotations common to all alerts

.ExternalURL

URL to Alertmanager

.GroupLabels

Labels used for grouping

Example: Using Kubernetes Secrets for Credentials

For production environments, it's recommended to store sensitive credentials (like Slack webhook URLs, SMTP passwords, or PagerDuty keys) in Kubernetes Secrets instead of hardcoding them in helm values.

Step 1: Create Kubernetes Secret

First, create a secret containing your sensitive credentials:

# Create secret with Slack webhook URL
kubectl create secret generic alertmanager-secrets \
  --namespace=<install_ns> \
  --from-literal=slack-webhook-url='https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' \
  --from-literal=smtp-password='your-smtp-password' \
  --from-literal=pagerduty-key='YOUR_PAGERDUTY_SERVICE_KEY'

Or using a YAML manifest:

# alertmanager-secrets.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-secrets
  namespace: <install_ns>
type: Opaque
stringData:
  slack-webhook-url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
  smtp-password: "your-smtp-password"
  pagerduty-key: "YOUR_PAGERDUTY_SERVICE_KEY"
  smtp-auth-identity: "[email protected]"

Apply the secret:

kubectl apply -f alertmanager-secrets.yaml

Step 2: Configure Alertmanager to Use Secret

Configure Alertmanager to mount the secret and reference credentials from environment variables or files:

# alertmanager-with-secrets-values.yaml
observability:
  enabled: true
  monitoring:
    prometheus:
      enabled: true
      alertmanager:
        enabled: true
        persistence:
          enabled: true
          size: 100Mi
        
        # Mount the secret as environment variables
        extraEnv:
          - name: SLACK_WEBHOOK_URL
            valueFrom:
              secretKeyRef:
                name: alertmanager-secrets
                key: slack-webhook-url
          - name: SMTP_PASSWORD
            valueFrom:
              secretKeyRef:
                name: alertmanager-secrets
                key: smtp-password
          - name: PAGERDUTY_KEY
            valueFrom:
              secretKeyRef:
                name: alertmanager-secrets
                key: pagerduty-key
        
        # Alternatively, mount secret as files
        extraSecretMounts:
          - name: alertmanager-secrets
            mountPath: /etc/alertmanager/secrets
            secretName: alertmanager-secrets
            readOnly: true
        
        config:
          enabled: true
          global:
            # Reference Slack webhook from environment variable
            slack_api_url_file: '/etc/alertmanager/secrets/slack-webhook-url'
            
            # SMTP configuration with password from secret
            smtp_smarthost: 'smtp.example.com:587'
            smtp_from: '[email protected]'
            smtp_auth_username: '[email protected]'
            smtp_auth_password_file: '/etc/alertmanager/secrets/smtp-password'
          
          receivers:
            - name: 'default-receiver'
            - name: 'slack-notifications'
              slack_configs:
                - channel: '#alerts'
                  send_resolved: true
                  # api_url can also be set per-receiver using file reference
                  api_url_file: '/etc/alertmanager/secrets/slack-webhook-url'
            - name: 'pagerduty-notifications'
              pagerduty_configs:
                - service_key_file: '/etc/alertmanager/secrets/pagerduty-key'
                  severity: 'critical'
            - name: 'email-notifications'
              email_configs:
                - to: '[email protected]'
                  send_resolved: true
          
          route:
            group_by: ['alertname', 'namespace']
            group_wait: 30s
            group_interval: 5m
            repeat_interval: 4h
            receiver: 'default-receiver'
            routes:
              - match:
                  severity: critical
                receiver: 'slack-notifications'
                continue: true
              - match:
                  severity: critical
                receiver: 'pagerduty-notifications'
              - match:
                  severity: warning
                receiver: 'email-notifications'

Secret File Reference Options

Alertmanager supports _file suffix for many credential fields, which reads the value from a file:

Original Field

File Reference Field

Description

slack_api_url

slack_api_url_file

Global Slack webhook URL

api_url

api_url_file

Per-receiver Slack webhook URL

smtp_auth_password

smtp_auth_password_file

SMTP password

smtp_auth_identity

smtp_auth_identity_file

SMTP identity

smtp_auth_secret

smtp_auth_secret_file

SMTP secret

service_key

service_key_file

PagerDuty service key

routing_key

routing_key_file

PagerDuty routing key

token

token_file

Opsgenie/VictorOps token

url

url_file

Webhook URL

Example: Complete Setup with External Secrets Operator

For organizations using External Secrets Operator (ESO) to sync secrets from external secret managers (AWS Secrets Manager, HashiCorp Vault, etc.):

# external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: alertmanager-secrets
  namespace: <install_ns>
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: alertmanager-secrets
    creationPolicy: Owner
  data:
    - secretKey: slack-webhook-url
      remoteRef:
        key: /production/alertmanager/slack-webhook
    - secretKey: smtp-password
      remoteRef:
        key: /production/alertmanager/smtp-password
    - secretKey: pagerduty-key
      remoteRef:
        key: /production/alertmanager/pagerduty-key

Apply the ExternalSecret:

kubectl apply -f external-secret.yaml

The External Secrets Operator will automatically create and sync the alertmanager-secrets Kubernetes Secret from your external secret manager.

Security Best Practices:

Never commit secrets to version control
Use _file references instead of inline credentials
Rotate secrets regularly
Use RBAC to restrict access to secrets
Consider using sealed-secrets or external-secrets-operator for GitOps workflows

Example: Complete Observability Stack with Alertmanager

The following is a comprehensive example enabling the full observability stack with Alertmanager, custom alerting rules, and persistent storage:

# full-observability-values.yaml
observability:
  enabled: true
  name: "t4k-observability"
  
  logging:
    loki:
      enabled: true
      fullnameOverride: "loki"
      singleBinary:
        replicas: 1
        persistence:
          enabled: true
          size: 20Gi
    promtail:
      enabled: true
      fullnameOverride: "promtail"
  
  monitoring:
    prometheus:
      enabled: true
      fullnameOverride: "prometheus"
      server:
        enabled: true
        fullnameOverride: "prometheus-server"
        persistentVolume:
          enabled: true
          size: 10Gi
        retention: "30d"
      
      # Enable kube-state-metrics for Kubernetes metrics
      kube-state-metrics:
        enabled: true
      
      # Enable node-exporter for node-level metrics
      prometheus-node-exporter:
        enabled: true
      
      alertmanager:
        enabled: true
        replicaCount: 1
        persistence:
          enabled: true
          size: 100Mi
        
        # Alertmanager configuration
        config:
          enabled: true
          global:
            slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            resolve_timeout: 5m
          
          templates:
            - '/etc/alertmanager/*.tmpl'
          
          receivers:
            - name: 'default-receiver'
            - name: 'slack-critical'
              slack_configs:
                - channel: '#critical-alerts'
                  send_resolved: true
                  title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
                  text: >-
                    {{ range .Alerts }}
                    *Alert:* {{ .Annotations.summary }}
                    *Description:* {{ .Annotations.description }}
                    *Severity:* {{ .Labels.severity }}
                    *Namespace:* {{ .Labels.namespace }}
                    {{ end }}
            - name: 'slack-warning'
              slack_configs:
                - channel: '#warning-alerts'
                  send_resolved: true
          
          route:
            group_by: ['alertname', 'namespace', 'severity']
            group_wait: 30s
            group_interval: 5m
            repeat_interval: 4h
            receiver: 'default-receiver'
            routes:
              - match:
                  severity: critical
                receiver: 'slack-critical'
              - match:
                  severity: warning
                receiver: 'slack-warning'
      
      # Custom alerting rules for T4K
      serverFiles:
        alerting_rules.yml:
          groups:
            - name: t4k-backup-alerts
              rules:
                # Alert when backup fails (metric value -1 indicates Failed/Error status)
                - alert: T4KBackupFailed
                  expr: trilio_backup_info == -1
                  for: 1m
                  labels:
                    severity: critical
                  annotations:
                    summary: "TrilioVault Backup Failed"
                    description: "Backup {{ $labels.backup }} has failed in namespace {{ $labels.resource_namespace }} on cluster {{ $labels.cluster }}"
                
                # Alert when backup is stuck in progress for too long
                - alert: T4KBackupStuck
                  expr: trilio_backup_info{status="InProgress"} == 0 and trilio_backup_status_percentage < 100
                  for: 60m
                  labels:
                    severity: warning
                  annotations:
                    summary: "TrilioVault Backup Stuck"
                    description: "Backup {{ $labels.backup }} has been in progress for more than 60 minutes in namespace {{ $labels.resource_namespace }}. Current progress: {{ $value }}%"
                
                # Alert when backup takes unusually long time
                - alert: T4KBackupDurationHigh
                  expr: trilio_backup_completed_duration > 120
                  for: 5m
                  labels:
                    severity: warning
                  annotations:
                    summary: "TrilioVault Backup Duration High"
                    description: "Backup {{ $labels.backup }} took {{ $value }} minutes to complete in namespace {{ $labels.resource_namespace }}"
            
            - name: t4k-restore-alerts
              rules:
                # Alert when restore fails
                - alert: T4KRestoreFailed
                  expr: trilio_restore_info == -1
                  for: 1m
                  labels:
                    severity: critical
                  annotations:
                    summary: "TrilioVault Restore Failed"
                    description: "Restore {{ $labels.restore }} has failed in namespace {{ $labels.resource_namespace }} on cluster {{ $labels.cluster }}"
                
                # Alert when restore is stuck
                - alert: T4KRestoreStuck
                  expr: trilio_restore_info{status="InProgress"} == 0 and trilio_restore_status_percentage < 100
                  for: 60m
                  labels:
                    severity: warning
                  annotations:
                    summary: "TrilioVault Restore Stuck"
                    description: "Restore {{ $labels.restore }} has been in progress for more than 60 minutes in namespace {{ $labels.resource_namespace }}"
            
            - name: t4k-target-alerts
              rules:
                # Alert when target is unavailable (metric value 0 indicates unavailable)
                - alert: T4KTargetUnavailable
                  expr: trilio_target_info == 0
                  for: 5m
                  labels:
                    severity: critical
                  annotations:
                    summary: "TrilioVault Target Unavailable"
                    description: "Target {{ $labels.target }} is not available in namespace {{ $labels.resource_namespace }}. Status: {{ $labels.status }}"
                
                # Alert when target storage exceeds threshold (example: 500GB)
                - alert: T4KTargetStorageHigh
                  expr: trilio_target_storage > 500000000000
                  for: 10m
                  labels:
                    severity: warning
                  annotations:
                    summary: "TrilioVault Target Storage High"
                    description: "Target {{ $labels.target }} storage usage is {{ $value | humanize1024 }}B in namespace {{ $labels.resource_namespace }}"
            
            - name: t4k-backupplan-alerts
              rules:
                # Alert when BackupPlan has no successful backups (not protected)
                - alert: T4KBackupPlanNotProtected
                  expr: trilio_backupplan_info{protected="False"} == 1
                  for: 24h
                  labels:
                    severity: warning
                  annotations:
                    summary: "TrilioVault BackupPlan Not Protected"
                    description: "BackupPlan {{ $labels.backupplan }} in namespace {{ $labels.resource_namespace }} has no successful backups for more than 24 hours"
                
                # Alert when BackupPlan fails
                - alert: T4KBackupPlanFailed
                  expr: trilio_backupplan_info == -1
                  for: 5m
                  labels:
                    severity: critical
                  annotations:
                    summary: "TrilioVault BackupPlan Failed"
                    description: "BackupPlan {{ $labels.backupplan }} has failed in namespace {{ $labels.resource_namespace }}"
            
            - name: t4k-continuous-restore-alerts
              rules:
                # Alert when ContinuousRestorePlan fails
                - alert: T4KContinuousRestorePlanFailed
                  expr: trilio_continuousrestoreplan_info == -1
                  for: 5m
                  labels:
                    severity: critical
                  annotations:
                    summary: "TrilioVault ContinuousRestorePlan Failed"
                    description: "ContinuousRestorePlan {{ $labels.continuousrestoreplan }} has failed on cluster {{ $labels.cluster }}"
                
                # Alert when ConsistentSet fails
                - alert: T4KConsistentSetFailed
                  expr: trilio_consistentset_info == -1
                  for: 5m
                  labels:
                    severity: critical
                  annotations:
                    summary: "TrilioVault ConsistentSet Failed"
                    description: "ConsistentSet {{ $labels.consistentset }} has failed for ContinuousRestorePlan {{ $labels.continuousrestoreplan }}"
  
  visualization:
    grafana:
      enabled: true
      fullnameOverride: "grafana"
      adminPassword: "your-secure-password"
      service:
        type: ClusterIP

Install with:

helm install tvm triliovault-operator/k8s-triliovault-operator -f full-observability-values.yaml

Verifying Alertmanager Installation

After installation, verify that Alertmanager is running:

kubectl get pods -n <install_ns> | grep alertmanager

# Expected output:
# <release name>-alertmanager-0                                             1/1     Running            0          2m

Access the Alertmanager UI:

# Port forward to access locally
kubectl port-forward svc/alertmanager -n <install_ns> 9093:9093

Then open your browser to http://localhost:9093 to view the Alertmanager UI.

T4K Metrics Reference

TrilioVault for Kubernetes (T4K) exports Prometheus metrics through the k8s-triliovault-exporter component. These metrics can be used for monitoring, alerting, and dashboarding.

Metric Value Conventions

For status-based metrics (*_info metrics), the numeric value indicates the status:

Status

Metric Value

Description

Available / Completed

1

Resource is healthy/successful

Failed / Error

-1

Resource has failed

InProgress

0

Operation is in progress

Empty/Unknown

-2

Status not yet determined

Available Metrics

Backup Metrics

Metric Name

Description

Key Labels

trilio_backup_info

Backup status and metadata

backup, backupplan, resource_namespace, status, target, backup_type, start_ts, completion_ts, size, cluster, kind, hook, backupscope, applicationtype

trilio_backup_storage

Backup size in bytes

backup, backupplan, resource_namespace, status, target, backup_type, cluster, kind

trilio_backup_status_percentage

Backup progress (0-100)

backup, backupplan, resource_namespace, status, target, backup_type, cluster, kind

trilio_backup_completed_duration

Backup duration in minutes (only for completed backups)

backup, backupplan, resource_namespace, status, target, backup_type, cluster, kind

trilio_backup_metadata_info

Detailed backup object metadata

backup, backupplan, resource_namespace, status, objecttype, objectname, backupscope, applicationtype, apiversion, apigroup, object_resource

Restore Metrics

Metric Name

Description

Key Labels

trilio_restore_info

Restore status and metadata

restore, backup, backupplan, resource_namespace, status, target, size, start_ts, completion_ts, cluster, kind

trilio_restore_status_percentage

Restore progress (0-100)

restore, backup, resource_namespace, status, target, cluster, kind

trilio_restore_completed_duration

Restore duration in minutes (only for completed restores)

restore, backup, resource_namespace, status, target, cluster, kind

Target Metrics

Metric Name

Description

Key Labels

trilio_target_info

Target availability status (1=available, 0=unavailable)

target, resource_namespace, status, vendor, vendorType, browsing, eventTarget, size, threshold_capacity, creation_ts, cluster

trilio_target_storage

Storage used by target in bytes

target, resource_namespace, status, vendor, vendorType, threshold_capacity, creation_ts, cluster

BackupPlan Metrics

Metric Name

Description

Key Labels

trilio_backupplan_info

BackupPlan status and summary

backupplan, resource_namespace, status, target, protected, backup_count, lastprotected, backupscope, applicationtype, creation_ts, cluster, kind

trilio_backupplan_crstatus

BackupPlan continuous restore status

backupplan, continuousrestoreinstance, continuousrestore_enabled, continuousrestoreplan, consistentset_count, cr_status, cluster, kind

Continuous Restore Metrics

Metric Name

Description

Key Labels

trilio_continuousrestoreplan_info

ContinuousRestorePlan status

continuousrestoreplan, continuousrestorepolicy, target, consistentsetcount, sourcebackupplan, sourceinstanceinfo, status, creation_ts, cluster, kind

trilio_consistentset_info

ConsistentSet status and details

consistentset, consistentsetscope, continuousrestoreplan, sourcebackupplan, sourceinstanceinfo, backupName, backupNamespace, backupStatus, backupSize, status, size, cluster, kind

trilio_consistentset_status_percentage

ConsistentSet progress (0-100)

consistentset, consistentsetscope, continuousrestoreplan, sourcebackupplan, sourceinstanceinfo, backupName, status, cluster, kind

trilio_consistentset_completed_duration

ConsistentSet duration in minutes

consistentset, consistentsetscope, continuousrestoreplan, sourcebackupplan, sourceinstanceinfo, backupName, status, cluster, kind

Example PromQL Queries

# Count of failed backups by namespace
count(trilio_backup_info == -1) by (resource_namespace)

# List all successful backups
trilio_backup_info{status="Available"}

# Total backup storage per target
sum(trilio_backup_storage) by (target)

# Average backup duration by backupplan
avg(trilio_backup_completed_duration) by (backupplan)

# Unavailable targets
trilio_target_info == 0

# BackupPlans without successful backups
trilio_backupplan_info{protected="False"}

# Failed restores in last 24 hours
trilio_restore_info == -1

# ContinuousRestorePlan replication lag (ConsistentSets in progress)
trilio_consistentset_info{status="InProgress"}

Viewing Alert Rules in Grafana

Once alert rules are configured, you can view and manage them directly from the Grafana UI. Navigate to Alerting > Alert rules to see all configured rules, their current state, and firing alerts.

The Alert rules page shows:

Data source-managed rules: Alert rules defined in Prometheus configuration (e.g., /etc/config/alerting_rules.yml)
State: Current state of each alert (Firing, Normal, Pending, Recovering)
Health: Health status of the alert rule
Summary: Brief description of what the alert monitors

You can filter alerts by data source, dashboard, state, rule type, health status, and contact point.

View Logs From T4K UI

Login to T4K UI with preferred authentication
Select "Launch Event Viewer" on any required service or application

On click on "Launch Event Viewer" option, user will be redirected to Logs visibility page.

Accessing Grafana Dashboards

Grafana Endpoint : http://<T4K_IP>/grafana 

Login with default grafana credentials.
username: admin
password: admin123

if a custom path is configured then:

Grafana Endpoint : http://<T4K_IP>/<custom-path>/grafana

Additional Monitoring Components

Kube-State-Metrics

Kube-state-metrics generates metrics about the state of Kubernetes objects. Enable it to get comprehensive cluster metrics:

observability:
  enabled: true
  monitoring:
    prometheus:
      kube-state-metrics:
        enabled: true

Node Exporter

Node Exporter exposes hardware and OS metrics from the host machines:

observability:
  enabled: true
  monitoring:
    prometheus:
      prometheus-node-exporter:
        enabled: true

Pushgateway

Pushgateway allows ephemeral and batch jobs to expose metrics to Prometheus:

observability:
  enabled: true
  monitoring:
    prometheus:
      prometheus-pushgateway:
        enabled: true

PreviousObservability of Trilio with Openshift Monitoring NextModifying Default T4K Configuration

Last updated 1 month ago

hashtagIntroduction

hashtagT4K Installation with Observability using Trilio Operator

hashtagObservability Stack Configurable Parameters

hashtagEnabling ServiceMonitor for T4K Metrics

hashtagEnabling ServiceMonitor via Helm

hashtagServiceMonitor Configuration Parameters

hashtagVerifying Metrics Collection

hashtagAlertmanager Configuration

hashtagEnabling Alertmanager

hashtagAlertmanager Configurable Parameters

hashtagMinimal Alertmanager Configuration

hashtagExample: Alertmanager with Slack Notifications

hashtagExample: Alertmanager with Email Notifications

hashtagExample: Alertmanager with PagerDuty Integration

hashtagExample: Alertmanager with Custom Templates

hashtagTemplate Functions Reference

hashtagTemplate Variables

hashtagExample: Using Kubernetes Secrets for Credentials

hashtagStep 1: Create Kubernetes Secret

hashtagStep 2: Configure Alertmanager to Use Secret

hashtagSecret File Reference Options

hashtagExample: Complete Setup with External Secrets Operator

hashtagExample: Complete Observability Stack with Alertmanager

hashtagVerifying Alertmanager Installation

hashtagT4K Metrics Reference

hashtagMetric Value Conventions

hashtagAvailable Metrics

hashtagBackup Metrics

hashtagRestore Metrics

hashtagTarget Metrics

hashtagBackupPlan Metrics

hashtagContinuous Restore Metrics

hashtagExample PromQL Queries

hashtagViewing Alert Rules in Grafana

hashtagView Logs From T4K UI

hashtagAccessing Grafana Dashboards

hashtagAdditional Monitoring Components

hashtagKube-State-Metrics

hashtagNode Exporter

hashtagPushgateway

Introduction

T4K Installation with Observability using Trilio Operator

Observability Stack Configurable Parameters

Enabling ServiceMonitor for T4K Metrics

Enabling ServiceMonitor via Helm

ServiceMonitor Configuration Parameters

Verifying Metrics Collection

Alertmanager Configuration

Enabling Alertmanager

Alertmanager Configurable Parameters

Minimal Alertmanager Configuration

Example: Alertmanager with Slack Notifications

Example: Alertmanager with Email Notifications

Example: Alertmanager with PagerDuty Integration

Example: Alertmanager with Custom Templates

Template Functions Reference

Template Variables

Example: Using Kubernetes Secrets for Credentials

Step 1: Create Kubernetes Secret

Step 2: Configure Alertmanager to Use Secret

Secret File Reference Options

Example: Complete Setup with External Secrets Operator

Example: Complete Observability Stack with Alertmanager

Verifying Alertmanager Installation

T4K Metrics Reference

Metric Value Conventions

Available Metrics

Backup Metrics

Restore Metrics

Target Metrics

BackupPlan Metrics

Continuous Restore Metrics

Example PromQL Queries

Viewing Alert Rules in Grafana

View Logs From T4K UI

Accessing Grafana Dashboards

Additional Monitoring Components

Kube-State-Metrics

Node Exporter

Pushgateway