Observability of Trilio with Openshift Monitoring

Introduction

OpenShift Container Platform includes a built-in monitoring stack based on Prometheus that can be extended to monitor user workloads. This guide explains how to configure Trilio for Kubernetes (T4K) observability using OpenShift's native monitoring capabilities.

For more information about OpenShift monitoring, see the OpenShift Container Platform Monitoring Documentationarrow-up-right.

Prerequisites

  • OpenShift Container Platform 4.x cluster

  • Trilio for Kubernetes installed

  • Cluster administrator access (for initial setup)

  • User with monitoring-edit or monitoring-rules-edit role (for configuring alerts)

Enabling User Workload Monitoring

Before you can monitor T4K metrics, you must enable user workload monitoring on your OpenShift cluster.

1

Enable User Workload Monitoring — Create ConfigMap

Create or edit the cluster-monitoring-config ConfigMap in the openshift-monitoring namespace:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

Apply the configuration:

oc apply -f cluster-monitoring-config.yaml
2

Verify User Workload Monitoring is Running

Check that the user workload monitoring components are running:

oc get pods -n openshift-user-workload-monitoring

# Expected output:
# NAME                                   READY   STATUS    RESTARTS   AGE
# prometheus-operator-xxxxxxxxxx-xxxxx   2/2     Running   0          5m
# prometheus-user-workload-0             6/6     Running   0          5m
# prometheus-user-workload-1             6/6     Running   0          5m
# thanos-ruler-user-workload-0           4/4     Running   0          5m
# thanos-ruler-user-workload-1           4/4     Running   0          5m
3

Enable Alertmanager and AlertmanagerConfig

To use alerting with T4K, enable Alertmanager and allow users to create AlertmanagerConfig resources. There are two ConfigMaps to configure.

Configure Cluster Monitoring (openshift-monitoring)

Update the cluster-monitoring-config ConfigMap to enable user workload monitoring and allow AlertmanagerConfig resources to route to the platform Alertmanager:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
    alertmanagerMain:
      enableUserAlertmanagerConfig: true

Apply the configuration:

oc apply -f cluster-monitoring-config.yaml

Configure User Workload Monitoring (openshift-user-workload-monitoring)

To enable a dedicated Alertmanager for user workload monitoring (separate from the platform Alertmanager), create the user-workload-monitoring-config ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    alertmanager:
      enabled: true
      enableAlertmanagerConfig: true

Apply the configuration:

oc apply -f user-workload-monitoring-config.yaml
circle-info

Choosing between Platform and User Workload Alertmanager:

Configuration
Alertmanager Location
Use Case

cluster-monitoring-config with alertmanagerMain.enableUserAlertmanagerConfig: true

Platform (openshift-monitoring)

Route user alerts to the shared platform Alertmanager

user-workload-monitoring-config with alertmanager.enabled: true

User Workload (openshift-user-workload-monitoring)

Dedicated Alertmanager for user workloads, separate from platform alerts

circle-info

You can use either or both configurations depending on your requirements.

circle-exclamation
4

Verify Alertmanager is Running

Platform Alertmanager (openshift-monitoring)

Check that the platform Alertmanager pods are running:

oc get pods -n openshift-monitoring -l app.kubernetes.io/name=alertmanager

# Expected output:
# NAME                   READY   STATUS    RESTARTS   AGE
# alertmanager-main-0    6/6     Running   0          5m
# alertmanager-main-1    6/6     Running   0          5m

Access the platform Alertmanager UI:

# Port-forward to access the UI
oc port-forward -n openshift-monitoring svc/alertmanager-main 9093:9093

# Open http://localhost:9093 in your browser

User Workload Alertmanager (openshift-user-workload-monitoring)

If you enabled the user workload Alertmanager, verify it's running:

oc get pods -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager

# Expected output:
# NAME                            READY   STATUS    RESTARTS   AGE
# alertmanager-user-workload-0    6/6     Running   0          5m
# alertmanager-user-workload-1    6/6     Running   0          5m

Access the user workload Alertmanager UI:

# Port-forward to access the UI
oc port-forward -n openshift-user-workload-monitoring svc/alertmanager-user-workload 9093:9093

# Open http://localhost:9093 in your browser
circle-info

Tip: You can check which Alertmanager your alerts are routing to by viewing the alert in the OpenShift Console under Observe > Alerting > Alerts and checking the source.

5

Grant User Permissions for Alert Routing (Optional)

To allow non-admin users to create AlertmanagerConfig resources, grant them the alert-routing-edit role:

# Grant alert-routing-edit to a specific user
oc adm policy add-role-to-user alert-routing-edit <username> -n <namespace>

# Or grant to a group
oc adm policy add-role-to-group alert-routing-edit <groupname> -n <namespace>

Available roles for monitoring:

Role
Description

monitoring-rules-view

View PrometheusRule and AlertmanagerConfig resources

monitoring-rules-edit

Create/modify PrometheusRule resources

monitoring-edit

Create/modify ServiceMonitor, PodMonitor, and PrometheusRule resources

alert-routing-edit

Create/modify AlertmanagerConfig resources

Configuring T4K Metrics Collection

On OpenShift, enable Prometheus scraping for T4K by setting exporter.serviceMonitor.enabled: true in the TrilioVaultManager (TVM) Custom Resource.

Example TVM spec (edit your existing TVM and apply):

This creates:

  • A Service exposing the exporter metrics on port 8080 (created by the exporter)

  • A ServiceMonitor that configures Prometheus to scrape the metrics

Option 2: Create ServiceMonitor Manually

If you prefer to create the ServiceMonitor manually or need custom configuration, apply the following:

Apply the ServiceMonitor:

circle-info

When exporter.serviceMonitor.enabled is set to false (default), the exporter pod includes Prometheus scrape annotations (prometheus.io/scrape: "true"). If your Prometheus is configured to discover targets via annotations, metrics will be collected automatically without a ServiceMonitor.

Verifying Metrics Collection

After applying the ServiceMonitor, verify that metrics are being collected:

You can also verify from the OpenShift web console by navigating to Observe > Metrics and querying for trilio_backup_info.

Configuring T4K Alerting Rules

OpenShift uses PrometheusRule resources to define alerting rules. Create alerting rules for T4K in the namespace where T4K is installed, trilio-system by default.

T4K Alerting Rules

Create a file named t4k-prometheus-rules.yaml:

Apply the alerting rules:

Verifying Alerting Rules

Check that the alerting rules are loaded:

You can also view the alerts in the OpenShift web console by navigating to Observe > Alerting > Alerting rules.

Configuring Alert Routing

OpenShift supports AlertmanagerConfig resources for configuring alert routing in user workloads. This allows you to define custom receivers and routing rules for T4K alerts.

Prerequisites for Alert Routing

Ensure your user has the alert-routing-edit cluster role:

Example: AlertmanagerConfig with Slack Notifications

Create a file named t4k-alertmanager-config.yaml:

Creating the Slack Webhook Secret

Store your Slack webhook URL in a Kubernetes Secret:

Apply the AlertmanagerConfig:

Example: AlertmanagerConfig with Email Notifications

Create the SMTP password secret:

Example: AlertmanagerConfig with PagerDuty

Create the PagerDuty secret:

Custom Notification Templates

OpenShift Alertmanager supports custom notification templates for email and Slack messages. This section explains how to configure T4K-specific templates with rich formatting.

Use inline templates directly in your AlertmanagerConfig for full control over notification formatting:

Viewing Metrics and Alerts in OpenShift Console

1

Viewing Metrics

  • Navigate to Observe > Metrics in the OpenShift web console

  • Select your project namespace from the dropdown

  • Enter a PromQL query, for example:

    • trilio_backup_info - View all backup statuses

    • trilio_backup_info == -1 - View failed backups

    • trilio_target_info - View target statuses

2

Viewing Alerts

  • Navigate to Observe > Alerting in the OpenShift web console

  • Click on Alerting rules to see all configured rules including T4K rules

  • Click on Alerts to see currently firing alerts

  • Use filters to narrow down to T4K alerts by searching for "T4K"

3

Viewing Alert Silences

  • Navigate to Observe > Alerting > Silences

  • Create silences for maintenance windows or known issues

T4K Metrics Reference

Trilio for Kubernetes exports the following Prometheus metrics:

Metric Value Conventions

For status-based metrics (*_info metrics), the numeric value indicates the status:

Status
Metric Value
Description

Available / Completed

1

Resource is healthy/successful

Failed / Error

-1

Resource has failed

InProgress

0

Operation is in progress

Empty/Unknown

-2

Status not yet determined

Available Metrics

Backup Metrics

Metric Name
Description

trilio_backup_info

Backup status and metadata

trilio_backup_storage

Backup size in bytes

trilio_backup_status_percentage

Backup progress (0-100)

trilio_backup_completed_duration

Backup duration in minutes (only for completed backups)

trilio_backup_metadata_info

Detailed backup object metadata

Restore Metrics

Metric Name
Description

trilio_restore_info

Restore status and metadata

trilio_restore_status_percentage

Restore progress (0-100)

trilio_restore_completed_duration

Restore duration in minutes (only for completed restores)

Target Metrics

Metric Name
Description

trilio_target_info

Target availability status (1=available, 0=unavailable)

trilio_target_storage

Storage used by target in bytes

BackupPlan Metrics

Metric Name
Description

trilio_backupplan_info

BackupPlan status and summary

trilio_backupplan_crstatus

BackupPlan continuous restore status

Continuous Restore Metrics

Metric Name
Description

trilio_continuousrestoreplan_info

ContinuousRestorePlan status

trilio_consistentset_info

ConsistentSet status and details

trilio_consistentset_status_percentage

ConsistentSet progress (0-100)

trilio_consistentset_completed_duration

ConsistentSet duration in minutes

Example PromQL Queries

Separating Platform and User Alerts

OpenShift adds the label openshift_io_alert_source="platform" to all platform alerts. You can use this to configure different routing for T4K alerts:

Troubleshooting

Metrics Not Appearing

  1. Verify the T4K exporter pod is running:

  2. Check if the ServiceMonitor is correctly configured:

  3. Verify Prometheus is scraping the target:

Alerts Not Firing

  1. Verify the PrometheusRule is loaded:

  2. Check for errors in the Prometheus logs:

  3. Verify the alert expression returns results:

Alert Notifications Not Received

  1. Verify the AlertmanagerConfig is applied:

  2. Check the Alertmanager logs:

  3. Verify secrets are correctly created:

Additional Resources

Last updated