> For the complete documentation index, see [llms.txt](https://docs.trilio.io/kubernetes/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.trilio.io/kubernetes/configuration/observability/observability-of-trilio-with-openshift-monitoring.md). # Observability of Trilio with Openshift Monitoring ## Introduction OpenShift Container Platform includes a built-in monitoring stack based on Prometheus that can be extended to monitor user workloads. This guide explains how to configure Trilio for Kubernetes (T4K) observability using OpenShift's native monitoring capabilities. For more information about OpenShift monitoring, see the [OpenShift Container Platform Monitoring Documentation](https://docs.redhat.com/en/documentation/monitoring_stack_for_red_hat_openshift/4.19/html/configuring_user_workload_monitoring). ## Prerequisites * OpenShift Container Platform 4.x cluster * Trilio for Kubernetes installed * Cluster administrator access (for initial setup) * User with `monitoring-edit` or `monitoring-rules-edit` role (for configuring alerts) ## Enabling User Workload Monitoring Before you can monitor T4K metrics, you must enable user workload monitoring on your OpenShift cluster. {% stepper %} {% step %} #### Enable User Workload Monitoring — Create ConfigMap Create or edit the `cluster-monitoring-config` ConfigMap in the `openshift-monitoring` namespace: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true ``` Apply the configuration: ```bash oc apply -f cluster-monitoring-config.yaml ``` {% endstep %} {% step %} #### Verify User Workload Monitoring is Running Check that the user workload monitoring components are running: ```bash oc get pods -n openshift-user-workload-monitoring # Expected output: # NAME READY STATUS RESTARTS AGE # prometheus-operator-xxxxxxxxxx-xxxxx 2/2 Running 0 5m # prometheus-user-workload-0 6/6 Running 0 5m # prometheus-user-workload-1 6/6 Running 0 5m # thanos-ruler-user-workload-0 4/4 Running 0 5m # thanos-ruler-user-workload-1 4/4 Running 0 5m ``` {% endstep %} {% step %} #### Enable Alertmanager and AlertmanagerConfig To use alerting with T4K, enable Alertmanager and allow users to create `AlertmanagerConfig` resources. There are two ConfigMaps to configure. **Configure Cluster Monitoring (openshift-monitoring)** Update the `cluster-monitoring-config` ConfigMap to enable user workload monitoring and allow `AlertmanagerConfig` resources to route to the platform Alertmanager: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true alertmanagerMain: enableUserAlertmanagerConfig: true ``` Apply the configuration: ```bash oc apply -f cluster-monitoring-config.yaml ``` **Configure User Workload Monitoring (openshift-user-workload-monitoring)** To enable a dedicated Alertmanager for user workload monitoring (separate from the platform Alertmanager), create the `user-workload-monitoring-config` ConfigMap: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true ``` Apply the configuration: ```bash oc apply -f user-workload-monitoring-config.yaml ``` {% hint style="info" %} **Choosing between Platform and User Workload Alertmanager:** {% endhint %}

Configuration Alertmanager Location Use Case

Configuration	Alertmanager Location	Use Case
`cluster-monitoring-config` with `alertmanagerMain.` `enableUserAlertmanagerConfig: true`	Platform (`openshift-monitoring`)	Route user alerts to the shared platform Alertmanager
`user-workload-monitoring-config` with `alertmanager.enabled: true`	User Workload (`openshift-user-workload-monitoring`)	Dedicated Alertmanager for user workloads, separate from platform alerts

cluster-monitoring-config

with alertmanagerMain.

enableUserAlertmanagerConfig: true

Platform (openshift-monitoring)

Route user alerts to the shared platform Alertmanager

user-workload-monitoring-config with alertmanager.enabled: true User Workload (openshift-user-workload-monitoring) Dedicated Alertmanager for user workloads, separate from platform alerts

{% hint style="info" %} You can use either or both configurations depending on your requirements. {% endhint %} {% hint style="warning" %} **Note**: If you enable the user workload Alertmanager (`user-workload-monitoring-config`), `AlertmanagerConfig` resources in user namespaces will route to the user workload Alertmanager, not the platform Alertmanager. {% endhint %} {% endstep %} {% step %} #### Verify Alertmanager is Running **Platform Alertmanager (openshift-monitoring)** Check that the platform Alertmanager pods are running: ```bash oc get pods -n openshift-monitoring -l app.kubernetes.io/name=alertmanager # Expected output: # NAME READY STATUS RESTARTS AGE # alertmanager-main-0 6/6 Running 0 5m # alertmanager-main-1 6/6 Running 0 5m ``` Access the platform Alertmanager UI: ```bash # Port-forward to access the UI oc port-forward -n openshift-monitoring svc/alertmanager-main 9093:9093 # Open http://localhost:9093 in your browser ``` **User Workload Alertmanager (openshift-user-workload-monitoring)** If you enabled the user workload Alertmanager, verify it's running: ```bash oc get pods -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager # Expected output: # NAME READY STATUS RESTARTS AGE # alertmanager-user-workload-0 6/6 Running 0 5m # alertmanager-user-workload-1 6/6 Running 0 5m ``` Access the user workload Alertmanager UI: ```bash # Port-forward to access the UI oc port-forward -n openshift-user-workload-monitoring svc/alertmanager-user-workload 9093:9093 # Open http://localhost:9093 in your browser ``` {% hint style="info" %} **Tip**: You can check which Alertmanager your alerts are routing to by viewing the alert in the OpenShift Console under **Observe > Alerting > Alerts** and checking the source. {% endhint %} {% endstep %} {% step %} #### Grant User Permissions for Alert Routing (Optional) To allow non-admin users to create `AlertmanagerConfig` resources, grant them the `alert-routing-edit` role: ```bash # Grant alert-routing-edit to a specific user oc adm policy add-role-to-user alert-routing-edit -n # Or grant to a group oc adm policy add-role-to-group alert-routing-edit -n ``` Available roles for monitoring: | Role | Description | | ----------------------- | ---------------------------------------------------------------------- | | `monitoring-rules-view` | View PrometheusRule and AlertmanagerConfig resources | | `monitoring-rules-edit` | Create/modify PrometheusRule resources | | `monitoring-edit` | Create/modify ServiceMonitor, PodMonitor, and PrometheusRule resources | | `alert-routing-edit` | Create/modify AlertmanagerConfig resources | | {% endstep %} | | | {% endstepper %} | | ## Configuring T4K Metrics Collection #### Option 1: Enable ServiceMonitor via TVM Custom Resource (Recommended) On OpenShift, enable Prometheus scraping for T4K by setting `exporter.serviceMonitor.enabled: true` in the `TrilioVaultManager` (TVM) Custom Resource. Example TVM spec (edit your existing TVM and apply): ```bash # you can update TVM CR from openshift OperatorHub console apiVersion: triliovault.trilio.io/v1 kind: TrilioVaultManager metadata: name: tvm namespace: trilio-system spec: tvkInstanceName: tvk applicationScope: Cluster componentConfiguration: exporter: serviceMonitor: enabled: true ``` This creates: * A **Service** exposing the exporter metrics on port 8080 (created by the exporter) * A ServiceMonitor that configures Prometheus to scrape the metrics ### Option 2: Create ServiceMonitor Manually If you prefer to create the ServiceMonitor manually or need custom configuration, apply the following: ```yaml apiVersion: v1 kind: Service metadata: name: k8s-triliovault-exporter-service namespace: labels: app: k8s-triliovault-exporter spec: ports: - name: web protocol: TCP port: 8080 targetPort: 8080 selector: app: k8s-triliovault-exporter type: ClusterIP --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: k8s-triliovault-exporter namespace: labels: app: k8s-triliovault-exporter spec: selector: matchLabels: app: k8s-triliovault-exporter endpoints: - port: web interval: 30s path: /metrics scheme: http ``` Apply the ServiceMonitor: ```bash oc apply -f t4k-servicemonitor.yaml ``` {% hint style="info" %} When `exporter.serviceMonitor.enabled` is set to `false` (default), the exporter pod includes Prometheus scrape annotations (`prometheus.io/scrape: "true"`). If your Prometheus is configured to discover targets via annotations, metrics will be collected automatically without a ServiceMonitor. {% endhint %} ### Verifying Metrics Collection After applying the ServiceMonitor, verify that metrics are being collected: ```bash # Port-forward to the Thanos Querier oc port-forward -n openshift-monitoring svc/thanos-querier 9090:9090 # In another terminal, query for T4K metrics curl -s 'http://localhost:9090/api/v1/query?query=trilio_backup_info' | jq . ``` You can also verify from the OpenShift web console by navigating to **Observe > Metrics** and querying for `trilio_backup_info`.

## Configuring T4K Alerting Rules OpenShift uses `PrometheusRule` resources to define alerting rules. Create alerting rules for T4K in the namespace where T4K is installed, trilio-system by default. ### T4K Alerting Rules Create a file named `t4k-prometheus-rules.yaml`: ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: t4k-alerting-rules namespace: trilio-system labels: app: k8s-triliovault spec: groups: - name: t4k-backup-alerts rules: # Alert when backup fails (metric value -1 indicates Failed/Error status) - alert: T4KBackupFailed expr: trilio_backup_info == -1 for: 1m labels: severity: critical annotations: summary: "TrilioVault Backup Failed" description: "Backup {{ $labels.backup }} has failed in namespace {{ $labels.resource_namespace }} on cluster {{ $labels.cluster }}" # Alert when backup is stuck in progress for too long - alert: T4KBackupStuck expr: trilio_backup_info{status="InProgress"} == 0 and trilio_backup_status_percentage < 100 for: 60m labels: severity: warning annotations: summary: "TrilioVault Backup Stuck" description: "Backup {{ $labels.backup }} has been in progress for more than 60 minutes in namespace {{ $labels.resource_namespace }}. Current progress: {{ $value }}%" # Alert when backup takes unusually long time - alert: T4KBackupDurationHigh expr: trilio_backup_completed_duration > 120 for: 5m labels: severity: warning annotations: summary: "TrilioVault Backup Duration High" description: "Backup {{ $labels.backup }} took {{ $value }} minutes to complete in namespace {{ $labels.resource_namespace }}" - name: t4k-restore-alerts rules: # Alert when restore fails - alert: T4KRestoreFailed expr: trilio_restore_info == -1 for: 1m labels: severity: critical annotations: summary: "TrilioVault Restore Failed" description: "Restore {{ $labels.restore }} has failed in namespace {{ $labels.resource_namespace }} on cluster {{ $labels.cluster }}" # Alert when restore is stuck - alert: T4KRestoreStuck expr: trilio_restore_info{status="InProgress"} == 0 and trilio_restore_status_percentage < 100 for: 60m labels: severity: warning annotations: summary: "TrilioVault Restore Stuck" description: "Restore {{ $labels.restore }} has been in progress for more than 60 minutes in namespace {{ $labels.resource_namespace }}" - name: t4k-target-alerts rules: # Alert when target is unavailable (metric value 0 indicates unavailable) - alert: T4KTargetUnavailable expr: trilio_target_info == 0 for: 5m labels: severity: critical annotations: summary: "TrilioVault Target Unavailable" description: "Target {{ $labels.target }} is not available in namespace {{ $labels.resource_namespace }}. Status: {{ $labels.status }}" # Alert when target storage exceeds threshold (example: 500GB) - alert: T4KTargetStorageHigh expr: trilio_target_storage > 500000000000 for: 10m labels: severity: warning annotations: summary: "TrilioVault Target Storage High" description: "Target {{ $labels.target }} storage usage is {{ $value | humanize1024 }}B in namespace {{ $labels.resource_namespace }}" - name: t4k-backupplan-alerts rules: # Alert when BackupPlan has no successful backups (not protected) - alert: T4KBackupPlanNotProtected expr: trilio_backupplan_info{protected="False"} == 1 for: 24h labels: severity: warning annotations: summary: "TrilioVault BackupPlan Not Protected" description: "BackupPlan {{ $labels.backupplan }} in namespace {{ $labels.resource_namespace }} has no successful backups for more than 24 hours" # Alert when BackupPlan fails - alert: T4KBackupPlanFailed expr: trilio_backupplan_info == -1 for: 5m labels: severity: critical annotations: summary: "TrilioVault BackupPlan Failed" description: "BackupPlan {{ $labels.backupplan }} has failed in namespace {{ $labels.resource_namespace }}" - name: t4k-continuous-restore-alerts rules: # Alert when ContinuousRestorePlan fails - alert: T4KContinuousRestorePlanFailed expr: trilio_continuousrestoreplan_info == -1 for: 5m labels: severity: critical annotations: summary: "TrilioVault ContinuousRestorePlan Failed" description: "ContinuousRestorePlan {{ $labels.continuousrestoreplan }} has failed on cluster {{ $labels.cluster }}" # Alert when ConsistentSet fails - alert: T4KConsistentSetFailed expr: trilio_consistentset_info == -1 for: 5m labels: severity: critical annotations: summary: "TrilioVault ConsistentSet Failed" description: "ConsistentSet {{ $labels.consistentset }} has failed for ContinuousRestorePlan {{ $labels.continuousrestoreplan }}" ``` Apply the alerting rules: ```bash oc apply -f t4k-prometheus-rules.yaml ``` ### Verifying Alerting Rules Check that the alerting rules are loaded: ```bash oc get prometheusrules -n # View the rule details oc describe prometheusrule t4k-alerting-rules -n ``` You can also view the alerts in the OpenShift web console by navigating to **Observe > Alerting > Alerting rules**. ## Configuring Alert Routing OpenShift supports `AlertmanagerConfig` resources for configuring alert routing in user workloads. This allows you to define custom receivers and routing rules for T4K alerts. ### Prerequisites for Alert Routing Ensure your user has the `alert-routing-edit` cluster role: ```bash oc adm policy add-cluster-role-to-user alert-routing-edit -n ``` ### Example: AlertmanagerConfig with Slack Notifications Create a file named `t4k-alertmanager-config.yaml`: ```yaml apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: t4k-alert-routing namespace: labels: app: k8s-triliovault spec: route: receiver: default groupBy: - alertname - namespace groupWait: 30s groupInterval: 5m repeatInterval: 4h routes: - receiver: slack-critical matchers: - name: severity value: critical matchType: "=" - receiver: slack-warning matchers: - name: severity value: warning matchType: "=" receivers: - name: default - name: slack-critical slackConfigs: - apiURL: name: slack-webhook-secret key: webhook-url channel: '#t4k-critical-alerts' sendResolved: true title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}' text: |- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Severity:* {{ .Labels.severity }} *Namespace:* {{ .Labels.resource_namespace }} {{ end }} - name: slack-warning slackConfigs: - apiURL: name: slack-webhook-secret key: webhook-url channel: '#t4k-warning-alerts' sendResolved: true ``` ### Creating the Slack Webhook Secret Store your Slack webhook URL in a Kubernetes Secret: ```bash oc create secret generic slack-webhook-secret \ --from-literal=webhook-url='https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' \ -n ``` Apply the AlertmanagerConfig: ```bash oc apply -f t4k-alertmanager-config.yaml ``` ### Example: AlertmanagerConfig with Email Notifications ```yaml apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: t4k-email-alerts namespace: labels: app: k8s-triliovault spec: route: receiver: email-receiver groupBy: - alertname groupWait: 30s groupInterval: 5m repeatInterval: 4h receivers: - name: email-receiver emailConfigs: - to: 'backup-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.example.com:587' authUsername: 'alertmanager@example.com' authPassword: name: smtp-secret key: password sendResolved: true ``` Create the SMTP password secret: ```bash oc create secret generic smtp-secret \ --from-literal=password='your-smtp-password' \ -n ``` ### Example: AlertmanagerConfig with PagerDuty ```yaml apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: t4k-pagerduty-alerts namespace: labels: app: k8s-triliovault spec: route: receiver: pagerduty-critical groupBy: - alertname - namespace routes: - receiver: pagerduty-critical matchers: - name: severity value: critical matchType: "=" receivers: - name: pagerduty-critical pagerdutyConfigs: - serviceKey: name: pagerduty-secret key: service-key severity: critical description: '{{ .CommonAnnotations.summary }}' details: - key: namespace value: '{{ .CommonLabels.resource_namespace }}' - key: alertname value: '{{ .CommonLabels.alertname }}' ``` Create the PagerDuty secret: ```bash oc create secret generic pagerduty-secret \ --from-literal=service-key='YOUR_PAGERDUTY_SERVICE_KEY' \ -n ``` ## Custom Notification Templates OpenShift Alertmanager supports custom notification templates for email and Slack messages. This section explains how to configure T4K-specific templates with rich formatting. Use inline templates directly in your `AlertmanagerConfig` for full control over notification formatting: ```yaml apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: t4k-custom-alerts namespace: labels: app: k8s-triliovault spec: route: receiver: t4k-alerts groupBy: - alertname - backup - namespace groupWait: 30s groupInterval: 5m repeatInterval: 4h matchers: - name: alertname matchType: =~ value: "T4K.*" receivers: - name: t4k-alerts emailConfigs: - to: 'backup-team@example.com' from: 't4k-alerts@example.com' smarthost: 'smtp.example.com:587' authUsername: 'user' authPassword: name: t4k-smtp-secret key: password sendResolved: false # Custom email subject headers: - key: Subject value: '[{{ .Status | toUpper }}] TrilioVault: {{ .CommonLabels.alertname }}' # Custom HTML email body html: | {{ range .Alerts }}

Property	Value
Backup	{{ .Labels.backup }}
Kind	{{ .Labels.kind }}
BackupPlan	{{ .Labels.backupplan }}
Namespace	{{ .Labels.namespace }}
Target	{{ .Labels.target }}
Status	{{ .Labels.status }}
Cluster	{{ .Labels.cluster }}

{{ end }} slackConfigs: - apiURL: name: t4k-slack-secret key: webhook-url channel: '#t4k-alerts' sendResolved: false # Custom Slack title with emoji indicators title: '{{ if eq .CommonLabels.alertname "T4KBackupFailed" }}:x:{{ else }}:white_check_mark:{{ end }} [{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}' # Custom Slack message body text: | {{ if eq .CommonLabels.alertname "T4KBackupFailed" }}:rotating_light: *Backup Failed*{{ else }}:tada: *Backup Successful*{{ end }} {{ range .Alerts }} > *{{ .Annotations.summary }}* > {{ .Annotations.description }} *Backup Details* • Backup: `{{ .Labels.backup }}` • Kind: `{{ .Labels.kind }}` • BackupPlan: `{{ .Labels.backupplan }}` • Namespace: `{{ .Labels.namespace }}` • Target: `{{ .Labels.target }}` • Status: `{{ .Labels.status }}` • Cluster: `{{ .Labels.cluster }}` • Severity: `{{ .Labels.severity }}` {{ end }} # Dynamic color based on alert type color: '{{ if eq .CommonLabels.alertname "T4KBackupFailed" }}danger{{ else }}good{{ end }}' --- apiVersion: v1 kind: Secret metadata: name: t4k-smtp-secret namespace: type: Opaque stringData: password: "your-smtp-password" --- apiVersion: v1 kind: Secret metadata: name: t4k-slack-secret namespace: type: Opaque stringData: webhook-url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" ``` ## Viewing Metrics and Alerts in OpenShift Console {% stepper %} {% step %} #### Viewing Metrics * Navigate to **Observe > Metrics** in the OpenShift web console * Select your project namespace from the dropdown * Enter a PromQL query, for example: * `trilio_backup_info` - View all backup statuses * `trilio_backup_info == -1` - View failed backups * `trilio_target_info` - View target statuses {% endstep %} {% step %} #### Viewing Alerts * Navigate to **Observe > Alerting** in the OpenShift web console * Click on **Alerting rules** to see all configured rules including T4K rules * Click on **Alerts** to see currently firing alerts * Use filters to narrow down to T4K alerts by searching for "T4K" {% endstep %} {% step %} #### Viewing Alert Silences * Navigate to **Observe > Alerting > Silences** * Create silences for maintenance windows or known issues {% endstep %} {% endstepper %} ## T4K Metrics Reference Trilio for Kubernetes exports the following Prometheus metrics: ### Metric Value Conventions For status-based metrics (`*_info` metrics), the numeric value indicates the status: | Status | Metric Value | Description | | ------------------------- | ------------ | ------------------------------ | | `Available` / `Completed` | `1` | Resource is healthy/successful | | `Failed` / `Error` | `-1` | Resource has failed | | `InProgress` | `0` | Operation is in progress | | Empty/Unknown | `-2` | Status not yet determined | ### Available Metrics #### Backup Metrics | Metric Name | Description | | ---------------------------------- | ------------------------------------------------------- | | `trilio_backup_info` | Backup status and metadata | | `trilio_backup_storage` | Backup size in bytes | | `trilio_backup_status_percentage` | Backup progress (0-100) | | `trilio_backup_completed_duration` | Backup duration in minutes (only for completed backups) | | `trilio_backup_metadata_info` | Detailed backup object metadata | #### Restore Metrics | Metric Name | Description | | ----------------------------------- | --------------------------------------------------------- | | `trilio_restore_info` | Restore status and metadata | | `trilio_restore_status_percentage` | Restore progress (0-100) | | `trilio_restore_completed_duration` | Restore duration in minutes (only for completed restores) | #### Target Metrics | Metric Name | Description | | ----------------------- | ------------------------------------------------------- | | `trilio_target_info` | Target availability status (1=available, 0=unavailable) | | `trilio_target_storage` | Storage used by target in bytes | #### BackupPlan Metrics | Metric Name | Description | | ---------------------------- | ------------------------------------ | | `trilio_backupplan_info` | BackupPlan status and summary | | `trilio_backupplan_crstatus` | BackupPlan continuous restore status | #### Continuous Restore Metrics | Metric Name | Description | | ----------------------------------------- | --------------------------------- | | `trilio_continuousrestoreplan_info` | ContinuousRestorePlan status | | `trilio_consistentset_info` | ConsistentSet status and details | | `trilio_consistentset_status_percentage` | ConsistentSet progress (0-100) | | `trilio_consistentset_completed_duration` | ConsistentSet duration in minutes | ### Example PromQL Queries ```promql # Count of failed backups by namespace count(trilio_backup_info == -1) by (resource_namespace) # List all successful backups trilio_backup_info{status="Available"} # Total backup storage per target sum(trilio_backup_storage) by (target) # Average backup duration by backupplan avg(trilio_backup_completed_duration) by (backupplan) # Unavailable targets trilio_target_info == 0 # BackupPlans without successful backups trilio_backupplan_info{protected="False"} # Failed restores trilio_restore_info == -1 # ContinuousRestorePlan replication status trilio_consistentset_info{status="InProgress"} ``` ## Separating Platform and User Alerts OpenShift adds the label `openshift_io_alert_source="platform"` to all platform alerts. You can use this to configure different routing for T4K alerts: ```yaml apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: t4k-user-alerts-routing namespace: spec: route: receiver: t4k-alerts-receiver matchers: # Match only user-defined alerts (not platform alerts) - name: openshift_io_alert_source value: platform matchType: "!=" groupBy: - alertname - namespace receivers: - name: t4k-alerts-receiver slackConfigs: - apiURL: name: slack-webhook-secret key: webhook-url channel: '#t4k-alerts' sendResolved: true ``` ## Troubleshooting ### Metrics Not Appearing 1. Verify the T4K exporter pod is running: ```bash oc get pods -n -l app.kubernetes.io/name=k8s-triliovault-exporter ``` 2. Check if the ServiceMonitor is correctly configured: ```bash oc get servicemonitor -n ``` 3. Verify Prometheus is scraping the target: ```bash oc port-forward -n openshift-user-workload-monitoring svc/prometheus-user-workload 9090:9090 # Then visit http://localhost:9090/targets ``` ### Alerts Not Firing 1. Verify the PrometheusRule is loaded: ```bash oc get prometheusrules -n ``` 2. Check for errors in the Prometheus logs: ```bash oc logs -n openshift-user-workload-monitoring -l app.kubernetes.io/name=prometheus -c prometheus ``` 3. Verify the alert expression returns results: ```bash # Use the Metrics console to test your expression # Navigate to Observe > Metrics and run: trilio_backup_info == -1 ``` ### Alert Notifications Not Received 1. Verify the AlertmanagerConfig is applied: ```bash oc get alertmanagerconfig -n ``` 2. Check the Alertmanager logs: ```bash oc logs -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager ``` 3. Verify secrets are correctly created: ```bash oc get secrets -n | grep -E 'slack|smtp|pagerduty' ``` ## Additional Resources * [OpenShift Container Platform Monitoring Documentation](https://docs.redhat.com/en/documentation/monitoring_stack_for_red_hat_openshift/4.19/html/configuring_user_workload_monitoring) * [Prometheus Alerting Rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) * [Alertmanager Configuration](https://prometheus.io/docs/alerting/latest/configuration/)