# Observability of Trilio with Openshift Monitoring

## Introduction

OpenShift Container Platform includes a built-in monitoring stack based on Prometheus that can be extended to monitor user workloads. This guide explains how to configure Trilio for Kubernetes (T4K) observability using OpenShift's native monitoring capabilities.

For more information about OpenShift monitoring, see the [OpenShift Container Platform Monitoring Documentation](https://docs.redhat.com/en/documentation/monitoring_stack_for_red_hat_openshift/4.19/html/configuring_user_workload_monitoring).

## Prerequisites

* OpenShift Container Platform 4.x cluster
* Trilio for Kubernetes installed
* Cluster administrator access (for initial setup)
* User with `monitoring-edit` or `monitoring-rules-edit` role (for configuring alerts)

## Enabling User Workload Monitoring

Before you can monitor T4K metrics, you must enable user workload monitoring on your OpenShift cluster.

{% stepper %}
{% step %}

#### Enable User Workload Monitoring — Create ConfigMap

Create or edit the `cluster-monitoring-config` ConfigMap in the `openshift-monitoring` namespace:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
```

Apply the configuration:

```bash
oc apply -f cluster-monitoring-config.yaml
```

{% endstep %}

{% step %}

#### Verify User Workload Monitoring is Running

Check that the user workload monitoring components are running:

```bash
oc get pods -n openshift-user-workload-monitoring

# Expected output:
# NAME                                   READY   STATUS    RESTARTS   AGE
# prometheus-operator-xxxxxxxxxx-xxxxx   2/2     Running   0          5m
# prometheus-user-workload-0             6/6     Running   0          5m
# prometheus-user-workload-1             6/6     Running   0          5m
# thanos-ruler-user-workload-0           4/4     Running   0          5m
# thanos-ruler-user-workload-1           4/4     Running   0          5m
```

{% endstep %}

{% step %}

#### Enable Alertmanager and AlertmanagerConfig

To use alerting with T4K, enable Alertmanager and allow users to create `AlertmanagerConfig` resources. There are two ConfigMaps to configure.

**Configure Cluster Monitoring (openshift-monitoring)**

Update the `cluster-monitoring-config` ConfigMap to enable user workload monitoring and allow `AlertmanagerConfig` resources to route to the platform Alertmanager:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
    alertmanagerMain:
      enableUserAlertmanagerConfig: true
```

Apply the configuration:

```bash
oc apply -f cluster-monitoring-config.yaml
```

**Configure User Workload Monitoring (openshift-user-workload-monitoring)**

To enable a dedicated Alertmanager for user workload monitoring (separate from the platform Alertmanager), create the `user-workload-monitoring-config` ConfigMap:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    alertmanager:
      enabled: true
      enableAlertmanagerConfig: true
```

Apply the configuration:

```bash
oc apply -f user-workload-monitoring-config.yaml
```

{% hint style="info" %}
**Choosing between Platform and User Workload Alertmanager:**
{% endhint %}

<table><thead><tr><th width="307">Configuration</th><th width="314">Alertmanager Location</th><th>Use Case</th></tr></thead><tbody><tr><td><p><code>cluster-monitoring-config</code></p><p>with <code>alertmanagerMain.</code></p><p><code>enableUserAlertmanagerConfig: true</code></p></td><td>Platform (<code>openshift-monitoring</code>)</td><td>Route user alerts to the shared platform Alertmanager</td></tr><tr><td><code>user-workload-monitoring-config</code> with <code>alertmanager.enabled: true</code></td><td>User Workload (<code>openshift-user-workload-monitoring</code>)</td><td>Dedicated Alertmanager for user workloads, separate from platform alerts</td></tr></tbody></table>

{% hint style="info" %}
You can use either or both configurations depending on your requirements.
{% endhint %}

{% hint style="warning" %}
**Note**: If you enable the user workload Alertmanager (`user-workload-monitoring-config`), `AlertmanagerConfig` resources in user namespaces will route to the user workload Alertmanager, not the platform Alertmanager.
{% endhint %}
{% endstep %}

{% step %}

#### Verify Alertmanager is Running

**Platform Alertmanager (openshift-monitoring)**

Check that the platform Alertmanager pods are running:

```bash
oc get pods -n openshift-monitoring -l app.kubernetes.io/name=alertmanager

# Expected output:
# NAME                   READY   STATUS    RESTARTS   AGE
# alertmanager-main-0    6/6     Running   0          5m
# alertmanager-main-1    6/6     Running   0          5m
```

Access the platform Alertmanager UI:

```bash
# Port-forward to access the UI
oc port-forward -n openshift-monitoring svc/alertmanager-main 9093:9093

# Open http://localhost:9093 in your browser
```

**User Workload Alertmanager (openshift-user-workload-monitoring)**

If you enabled the user workload Alertmanager, verify it's running:

```bash
oc get pods -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager

# Expected output:
# NAME                            READY   STATUS    RESTARTS   AGE
# alertmanager-user-workload-0    6/6     Running   0          5m
# alertmanager-user-workload-1    6/6     Running   0          5m
```

Access the user workload Alertmanager UI:

```bash
# Port-forward to access the UI
oc port-forward -n openshift-user-workload-monitoring svc/alertmanager-user-workload 9093:9093

# Open http://localhost:9093 in your browser
```

{% hint style="info" %}
**Tip**: You can check which Alertmanager your alerts are routing to by viewing the alert in the OpenShift Console under **Observe > Alerting > Alerts** and checking the source.
{% endhint %}
{% endstep %}

{% step %}

#### Grant User Permissions for Alert Routing (Optional)

To allow non-admin users to create `AlertmanagerConfig` resources, grant them the `alert-routing-edit` role:

```bash
# Grant alert-routing-edit to a specific user
oc adm policy add-role-to-user alert-routing-edit <username> -n <namespace>

# Or grant to a group
oc adm policy add-role-to-group alert-routing-edit <groupname> -n <namespace>
```

Available roles for monitoring:

| Role                    | Description                                                            |
| ----------------------- | ---------------------------------------------------------------------- |
| `monitoring-rules-view` | View PrometheusRule and AlertmanagerConfig resources                   |
| `monitoring-rules-edit` | Create/modify PrometheusRule resources                                 |
| `monitoring-edit`       | Create/modify ServiceMonitor, PodMonitor, and PrometheusRule resources |
| `alert-routing-edit`    | Create/modify AlertmanagerConfig resources                             |
| {% endstep %}           |                                                                        |
| {% endstepper %}        |                                                                        |

## Configuring T4K Metrics Collection

#### Option 1: Enable ServiceMonitor via TVM Custom Resource (Recommended)

On OpenShift, enable Prometheus scraping for T4K by setting `exporter.serviceMonitor.enabled: true` in the `TrilioVaultManager` (TVM) Custom Resource.

Example TVM spec (edit your existing TVM and apply):

```bash
# you can update TVM CR from openshift OperatorHub console
apiVersion: triliovault.trilio.io/v1
kind: TrilioVaultManager
metadata:
  name: tvm
  namespace: trilio-system
spec:
  tvkInstanceName: tvk
  applicationScope: Cluster
  componentConfiguration:
    exporter:
      serviceMonitor:
        enabled: true
```

This creates:

* A **Service** exposing the exporter metrics on port 8080 (created by the exporter)
* A ServiceMonitor that configures Prometheus to scrape the metrics

### Option 2: Create ServiceMonitor Manually

If you prefer to create the ServiceMonitor manually or need custom configuration, apply the following:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: k8s-triliovault-exporter-service
  namespace: <t4k-install-namespace>
  labels:
    app: k8s-triliovault-exporter
spec:
  ports:
    - name: web
      protocol: TCP
      port: 8080
      targetPort: 8080
  selector:
    app: k8s-triliovault-exporter
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: k8s-triliovault-exporter
  namespace: <t4k-install-namespace>
  labels:
    app: k8s-triliovault-exporter
spec:
  selector:
    matchLabels:
      app: k8s-triliovault-exporter
  endpoints:
    - port: web
      interval: 30s
      path: /metrics
      scheme: http
```

Apply the ServiceMonitor:

```bash
oc apply -f t4k-servicemonitor.yaml
```

{% hint style="info" %}
When `exporter.serviceMonitor.enabled` is set to `false` (default), the exporter pod includes Prometheus scrape annotations (`prometheus.io/scrape: "true"`). If your Prometheus is configured to discover targets via annotations, metrics will be collected automatically without a ServiceMonitor.
{% endhint %}

### Verifying Metrics Collection

After applying the ServiceMonitor, verify that metrics are being collected:

```bash
# Port-forward to the Thanos Querier
oc port-forward -n openshift-monitoring svc/thanos-querier 9090:9090

# In another terminal, query for T4K metrics
curl -s 'http://localhost:9090/api/v1/query?query=trilio_backup_info' | jq .
```

You can also verify from the OpenShift web console by navigating to **Observe > Metrics** and querying for `trilio_backup_info`.

<figure><img src="/files/VPH2WnZkMqy0FnTl7l8L" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/m6l06GGE7MynbWJvcjg2" alt=""><figcaption></figcaption></figure>

## Configuring T4K Alerting Rules

OpenShift uses `PrometheusRule` resources to define alerting rules. Create alerting rules for T4K in the namespace where T4K is installed, trilio-system by default.

### T4K Alerting Rules

Create a file named `t4k-prometheus-rules.yaml`:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: t4k-alerting-rules
  namespace: trilio-system
  labels:
    app: k8s-triliovault
spec:
  groups:
    - name: t4k-backup-alerts
      rules:
        # Alert when backup fails (metric value -1 indicates Failed/Error status)
        - alert: T4KBackupFailed
          expr: trilio_backup_info == -1
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "TrilioVault Backup Failed"
            description: "Backup {{ $labels.backup }} has failed in namespace {{ $labels.resource_namespace }} on cluster {{ $labels.cluster }}"

        # Alert when backup is stuck in progress for too long
        - alert: T4KBackupStuck
          expr: trilio_backup_info{status="InProgress"} == 0 and trilio_backup_status_percentage < 100
          for: 60m
          labels:
            severity: warning
          annotations:
            summary: "TrilioVault Backup Stuck"
            description: "Backup {{ $labels.backup }} has been in progress for more than 60 minutes in namespace {{ $labels.resource_namespace }}. Current progress: {{ $value }}%"

        # Alert when backup takes unusually long time
        - alert: T4KBackupDurationHigh
          expr: trilio_backup_completed_duration > 120
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "TrilioVault Backup Duration High"
            description: "Backup {{ $labels.backup }} took {{ $value }} minutes to complete in namespace {{ $labels.resource_namespace }}"

    - name: t4k-restore-alerts
      rules:
        # Alert when restore fails
        - alert: T4KRestoreFailed
          expr: trilio_restore_info == -1
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "TrilioVault Restore Failed"
            description: "Restore {{ $labels.restore }} has failed in namespace {{ $labels.resource_namespace }} on cluster {{ $labels.cluster }}"

        # Alert when restore is stuck
        - alert: T4KRestoreStuck
          expr: trilio_restore_info{status="InProgress"} == 0 and trilio_restore_status_percentage < 100
          for: 60m
          labels:
            severity: warning
          annotations:
            summary: "TrilioVault Restore Stuck"
            description: "Restore {{ $labels.restore }} has been in progress for more than 60 minutes in namespace {{ $labels.resource_namespace }}"

    - name: t4k-target-alerts
      rules:
        # Alert when target is unavailable (metric value 0 indicates unavailable)
        - alert: T4KTargetUnavailable
          expr: trilio_target_info == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "TrilioVault Target Unavailable"
            description: "Target {{ $labels.target }} is not available in namespace {{ $labels.resource_namespace }}. Status: {{ $labels.status }}"

        # Alert when target storage exceeds threshold (example: 500GB)
        - alert: T4KTargetStorageHigh
          expr: trilio_target_storage > 500000000000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "TrilioVault Target Storage High"
            description: "Target {{ $labels.target }} storage usage is {{ $value | humanize1024 }}B in namespace {{ $labels.resource_namespace }}"

    - name: t4k-backupplan-alerts
      rules:
        # Alert when BackupPlan has no successful backups (not protected)
        - alert: T4KBackupPlanNotProtected
          expr: trilio_backupplan_info{protected="False"} == 1
          for: 24h
          labels:
            severity: warning
          annotations:
            summary: "TrilioVault BackupPlan Not Protected"
            description: "BackupPlan {{ $labels.backupplan }} in namespace {{ $labels.resource_namespace }} has no successful backups for more than 24 hours"

        # Alert when BackupPlan fails
        - alert: T4KBackupPlanFailed
          expr: trilio_backupplan_info == -1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "TrilioVault BackupPlan Failed"
            description: "BackupPlan {{ $labels.backupplan }} has failed in namespace {{ $labels.resource_namespace }}"

    - name: t4k-continuous-restore-alerts
      rules:
        # Alert when ContinuousRestorePlan fails
        - alert: T4KContinuousRestorePlanFailed
          expr: trilio_continuousrestoreplan_info == -1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "TrilioVault ContinuousRestorePlan Failed"
            description: "ContinuousRestorePlan {{ $labels.continuousrestoreplan }} has failed on cluster {{ $labels.cluster }}"

        # Alert when ConsistentSet fails
        - alert: T4KConsistentSetFailed
          expr: trilio_consistentset_info == -1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "TrilioVault ConsistentSet Failed"
            description: "ConsistentSet {{ $labels.consistentset }} has failed for ContinuousRestorePlan {{ $labels.continuousrestoreplan }}"
```

Apply the alerting rules:

```bash
oc apply -f t4k-prometheus-rules.yaml
```

### Verifying Alerting Rules

Check that the alerting rules are loaded:

```bash
oc get prometheusrules -n <t4k-install-namespace>

# View the rule details
oc describe prometheusrule t4k-alerting-rules -n <t4k-install-namespace>
```

You can also view the alerts in the OpenShift web console by navigating to **Observe > Alerting > Alerting rules**.

## Configuring Alert Routing

OpenShift supports `AlertmanagerConfig` resources for configuring alert routing in user workloads. This allows you to define custom receivers and routing rules for T4K alerts.

### Prerequisites for Alert Routing

Ensure your user has the `alert-routing-edit` cluster role:

```bash
oc adm policy add-cluster-role-to-user alert-routing-edit <username> -n <t4k-install-namespace>
```

### Example: AlertmanagerConfig with Slack Notifications

Create a file named `t4k-alertmanager-config.yaml`:

```yaml
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: t4k-alert-routing
  namespace: <t4k-install-namespace>
  labels:
    app: k8s-triliovault
spec:
  route:
    receiver: default
    groupBy:
      - alertname
      - namespace
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 4h
    routes:
      - receiver: slack-critical
        matchers:
          - name: severity
            value: critical
            matchType: "="
      - receiver: slack-warning
        matchers:
          - name: severity
            value: warning
            matchType: "="

  receivers:
    - name: default

    - name: slack-critical
      slackConfigs:
        - apiURL:
            name: slack-webhook-secret
            key: webhook-url
          channel: '#t4k-critical-alerts'
          sendResolved: true
          title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
          text: |-
            {{ range .Alerts }}
            *Alert:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Severity:* {{ .Labels.severity }}
            *Namespace:* {{ .Labels.resource_namespace }}
            {{ end }}

    - name: slack-warning
      slackConfigs:
        - apiURL:
            name: slack-webhook-secret
            key: webhook-url
          channel: '#t4k-warning-alerts'
          sendResolved: true
```

### Creating the Slack Webhook Secret

Store your Slack webhook URL in a Kubernetes Secret:

```bash
oc create secret generic slack-webhook-secret \
  --from-literal=webhook-url='https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' \
  -n <t4k-install-namespace>
```

Apply the AlertmanagerConfig:

```bash
oc apply -f t4k-alertmanager-config.yaml
```

### Example: AlertmanagerConfig with Email Notifications

```yaml
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: t4k-email-alerts
  namespace: <t4k-install-namespace>
  labels:
    app: k8s-triliovault
spec:
  route:
    receiver: email-receiver
    groupBy:
      - alertname
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 4h

  receivers:
    - name: email-receiver
      emailConfigs:
        - to: 'backup-team@example.com'
          from: 'alertmanager@example.com'
          smarthost: 'smtp.example.com:587'
          authUsername: 'alertmanager@example.com'
          authPassword:
            name: smtp-secret
            key: password
          sendResolved: true
```

Create the SMTP password secret:

```bash
oc create secret generic smtp-secret \
  --from-literal=password='your-smtp-password' \
  -n <t4k-install-namespace>
```

### Example: AlertmanagerConfig with PagerDuty

```yaml
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: t4k-pagerduty-alerts
  namespace: <t4k-install-namespace>
  labels:
    app: k8s-triliovault
spec:
  route:
    receiver: pagerduty-critical
    groupBy:
      - alertname
      - namespace
    routes:
      - receiver: pagerduty-critical
        matchers:
          - name: severity
            value: critical
            matchType: "="

  receivers:
    - name: pagerduty-critical
      pagerdutyConfigs:
        - serviceKey:
            name: pagerduty-secret
            key: service-key
          severity: critical
          description: '{{ .CommonAnnotations.summary }}'
          details:
            - key: namespace
              value: '{{ .CommonLabels.resource_namespace }}'
            - key: alertname
              value: '{{ .CommonLabels.alertname }}'
```

Create the PagerDuty secret:

```bash
oc create secret generic pagerduty-secret \
  --from-literal=service-key='YOUR_PAGERDUTY_SERVICE_KEY' \
  -n <t4k-install-namespace>
```

## Custom Notification Templates

OpenShift Alertmanager supports custom notification templates for email and Slack messages. This section explains how to configure T4K-specific templates with rich formatting.

Use inline templates directly in your `AlertmanagerConfig` for full control over notification formatting:

```yaml
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: t4k-custom-alerts
  namespace: <t4k-install-namespace>
  labels:
    app: k8s-triliovault
spec:
  route:
    receiver: t4k-alerts
    groupBy:
      - alertname
      - backup
      - namespace
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 4h
    matchers:
      - name: alertname
        matchType: =~
        value: "T4K.*"

  receivers:
    - name: t4k-alerts
      emailConfigs:
        - to: 'backup-team@example.com'
          from: 't4k-alerts@example.com'
          smarthost: 'smtp.example.com:587'
          authUsername: 'user'
          authPassword:
            name: t4k-smtp-secret
            key: password
          sendResolved: false
          # Custom email subject
          headers:
            - key: Subject
              value: '[{{ .Status | toUpper }}] TrilioVault: {{ .CommonLabels.alertname }}'
          # Custom HTML email body
          html: |
            <!DOCTYPE html>
            <html>
            <head><style>
              body { font-family: Arial, sans-serif; padding: 20px; max-width: 800px; margin: 0 auto; }
              .header { padding: 15px; background: {{ if eq .CommonLabels.alertname "T4KBackupFailed" }}#f44336{{ else }}#4CAF50{{ end }}; color: white; border-radius: 4px; }
              .content { padding: 20px; background: #f5f5f5; margin-top: 10px; border-radius: 4px; }
              table { width: 100%; border-collapse: collapse; margin: 15px 0; }
              th { background: #34495e; color: white; padding: 10px; text-align: left; }
              td { padding: 10px; border-bottom: 1px solid #ddd; background: white; }
              .footer { margin-top: 20px; color: #666; font-size: 12px; }
            </style></head>
            <body>
              <div class="header">
                <h1>TrilioVault: {{ .CommonLabels.alertname }}</h1>
              </div>
              {{ range .Alerts }}
              <div class="content">
                <h3>{{ .Annotations.summary }}</h3>
                <p>{{ .Annotations.description }}</p>
                <table>
                  <tr><th>Property</th><th>Value</th></tr>
                  <tr><td>Backup</td><td>{{ .Labels.backup }}</td></tr>
                  <tr><td>Kind</td><td>{{ .Labels.kind }}</td></tr>
                  <tr><td>BackupPlan</td><td>{{ .Labels.backupplan }}</td></tr>
                  <tr><td>Namespace</td><td>{{ .Labels.namespace }}</td></tr>
                  <tr><td>Target</td><td>{{ .Labels.target }}</td></tr>
                  <tr><td>Status</td><td>{{ .Labels.status }}</td></tr>
                  <tr><td>Cluster</td><td>{{ .Labels.cluster }}</td></tr>
                </table>
              </div>
              {{ end }}
              <div class="footer">
                <p>Severity: {{ .CommonLabels.severity }}</p>
              </div>
            </body>
            </html>

      slackConfigs:
        - apiURL:
            name: t4k-slack-secret
            key: webhook-url
          channel: '#t4k-alerts'
          sendResolved: false
          # Custom Slack title with emoji indicators
          title: '{{ if eq .CommonLabels.alertname "T4KBackupFailed" }}:x:{{ else }}:white_check_mark:{{ end }} [{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
          # Custom Slack message body
          text: |
            {{ if eq .CommonLabels.alertname "T4KBackupFailed" }}:rotating_light: *Backup Failed*{{ else }}:tada: *Backup Successful*{{ end }}
            {{ range .Alerts }}
            > *{{ .Annotations.summary }}*
            > {{ .Annotations.description }}

            *Backup Details*
            • Backup: `{{ .Labels.backup }}`
            • Kind: `{{ .Labels.kind }}`
            • BackupPlan: `{{ .Labels.backupplan }}`
            • Namespace: `{{ .Labels.namespace }}`
            • Target: `{{ .Labels.target }}`
            • Status: `{{ .Labels.status }}`
            • Cluster: `{{ .Labels.cluster }}`
            • Severity: `{{ .Labels.severity }}`
            {{ end }}
          # Dynamic color based on alert type
          color: '{{ if eq .CommonLabels.alertname "T4KBackupFailed" }}danger{{ else }}good{{ end }}'
---
apiVersion: v1
kind: Secret
metadata:
  name: t4k-smtp-secret
  namespace: <t4k-install-namespace>
type: Opaque
stringData:
  password: "your-smtp-password"
---
apiVersion: v1
kind: Secret
metadata:
  name: t4k-slack-secret
  namespace: <t4k-install-namespace>
type: Opaque
stringData:
  webhook-url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
```

## Viewing Metrics and Alerts in OpenShift Console

{% stepper %}
{% step %}

#### Viewing Metrics

* Navigate to **Observe > Metrics** in the OpenShift web console
* Select your project namespace from the dropdown
* Enter a PromQL query, for example:
  * `trilio_backup_info` - View all backup statuses
  * `trilio_backup_info == -1` - View failed backups
  * `trilio_target_info` - View target statuses
    {% endstep %}

{% step %}

#### Viewing Alerts

* Navigate to **Observe > Alerting** in the OpenShift web console
* Click on **Alerting rules** to see all configured rules including T4K rules
* Click on **Alerts** to see currently firing alerts
* Use filters to narrow down to T4K alerts by searching for "T4K"
  {% endstep %}

{% step %}

#### Viewing Alert Silences

* Navigate to **Observe > Alerting > Silences**
* Create silences for maintenance windows or known issues
  {% endstep %}
  {% endstepper %}

## T4K Metrics Reference

Trilio for Kubernetes exports the following Prometheus metrics:

### Metric Value Conventions

For status-based metrics (`*_info` metrics), the numeric value indicates the status:

| Status                    | Metric Value | Description                    |
| ------------------------- | ------------ | ------------------------------ |
| `Available` / `Completed` | `1`          | Resource is healthy/successful |
| `Failed` / `Error`        | `-1`         | Resource has failed            |
| `InProgress`              | `0`          | Operation is in progress       |
| Empty/Unknown             | `-2`         | Status not yet determined      |

### Available Metrics

#### Backup Metrics

| Metric Name                        | Description                                             |
| ---------------------------------- | ------------------------------------------------------- |
| `trilio_backup_info`               | Backup status and metadata                              |
| `trilio_backup_storage`            | Backup size in bytes                                    |
| `trilio_backup_status_percentage`  | Backup progress (0-100)                                 |
| `trilio_backup_completed_duration` | Backup duration in minutes (only for completed backups) |
| `trilio_backup_metadata_info`      | Detailed backup object metadata                         |

#### Restore Metrics

| Metric Name                         | Description                                               |
| ----------------------------------- | --------------------------------------------------------- |
| `trilio_restore_info`               | Restore status and metadata                               |
| `trilio_restore_status_percentage`  | Restore progress (0-100)                                  |
| `trilio_restore_completed_duration` | Restore duration in minutes (only for completed restores) |

#### Target Metrics

| Metric Name             | Description                                             |
| ----------------------- | ------------------------------------------------------- |
| `trilio_target_info`    | Target availability status (1=available, 0=unavailable) |
| `trilio_target_storage` | Storage used by target in bytes                         |

#### BackupPlan Metrics

| Metric Name                  | Description                          |
| ---------------------------- | ------------------------------------ |
| `trilio_backupplan_info`     | BackupPlan status and summary        |
| `trilio_backupplan_crstatus` | BackupPlan continuous restore status |

#### Continuous Restore Metrics

| Metric Name                               | Description                       |
| ----------------------------------------- | --------------------------------- |
| `trilio_continuousrestoreplan_info`       | ContinuousRestorePlan status      |
| `trilio_consistentset_info`               | ConsistentSet status and details  |
| `trilio_consistentset_status_percentage`  | ConsistentSet progress (0-100)    |
| `trilio_consistentset_completed_duration` | ConsistentSet duration in minutes |

### Example PromQL Queries

```promql
# Count of failed backups by namespace
count(trilio_backup_info == -1) by (resource_namespace)

# List all successful backups
trilio_backup_info{status="Available"}

# Total backup storage per target
sum(trilio_backup_storage) by (target)

# Average backup duration by backupplan
avg(trilio_backup_completed_duration) by (backupplan)

# Unavailable targets
trilio_target_info == 0

# BackupPlans without successful backups
trilio_backupplan_info{protected="False"}

# Failed restores
trilio_restore_info == -1

# ContinuousRestorePlan replication status
trilio_consistentset_info{status="InProgress"}
```

## Separating Platform and User Alerts

OpenShift adds the label `openshift_io_alert_source="platform"` to all platform alerts. You can use this to configure different routing for T4K alerts:

```yaml
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: t4k-user-alerts-routing
  namespace: <t4k-install-namespace>
spec:
  route:
    receiver: t4k-alerts-receiver
    matchers:
      # Match only user-defined alerts (not platform alerts)
      - name: openshift_io_alert_source
        value: platform
        matchType: "!="
    groupBy:
      - alertname
      - namespace

  receivers:
    - name: t4k-alerts-receiver
      slackConfigs:
        - apiURL:
            name: slack-webhook-secret
            key: webhook-url
          channel: '#t4k-alerts'
          sendResolved: true
```

## Troubleshooting

### Metrics Not Appearing

1. Verify the T4K exporter pod is running:

   ```bash
   oc get pods -n <t4k-install-namespace> -l app.kubernetes.io/name=k8s-triliovault-exporter
   ```
2. Check if the ServiceMonitor is correctly configured:

   ```bash
   oc get servicemonitor -n <t4k-install-namespace>
   ```
3. Verify Prometheus is scraping the target:

   ```bash
   oc port-forward -n openshift-user-workload-monitoring svc/prometheus-user-workload 9090:9090
   # Then visit http://localhost:9090/targets
   ```

### Alerts Not Firing

1. Verify the PrometheusRule is loaded:

   ```bash
   oc get prometheusrules -n <t4k-install-namespace>
   ```
2. Check for errors in the Prometheus logs:

   ```bash
   oc logs -n openshift-user-workload-monitoring -l app.kubernetes.io/name=prometheus -c prometheus
   ```
3. Verify the alert expression returns results:

   ```bash
   # Use the Metrics console to test your expression
   # Navigate to Observe > Metrics and run: trilio_backup_info == -1
   ```

### Alert Notifications Not Received

1. Verify the AlertmanagerConfig is applied:

   ```bash
   oc get alertmanagerconfig -n <t4k-install-namespace>
   ```
2. Check the Alertmanager logs:

   ```bash
   oc logs -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager
   ```
3. Verify secrets are correctly created:

   ```bash
   oc get secrets -n <t4k-install-namespace> | grep -E 'slack|smtp|pagerduty'
   ```

## Additional Resources

* [OpenShift Container Platform Monitoring Documentation](https://docs.redhat.com/en/documentation/monitoring_stack_for_red_hat_openshift/4.19/html/configuring_user_workload_monitoring)
* [Prometheus Alerting Rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
* [Alertmanager Configuration](https://prometheus.io/docs/alerting/latest/configuration/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.trilio.io/kubernetes/configuration/observability/observability-of-trilio-with-openshift-monitoring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
