T4K Integration with Observability Stack

Introduction

The Observability Stack is a pre-packaged distribution for monitoring, logging, and dashboarding and can be installed into any existing Kubernetes cluster. It includes many of the most popular open-source observability tools with Prometheus, Grafana, Promtail**,** and Loki. The observability stack provides a straightforward, maintainable solution for analyzing server traffic and identifying potential deployment problems.

T4K Installation with Observability using Trilio Operator

To install the operator with observability enabled, run the latest helm chart with the following parameter set.

helm repo add triliovault-operator https://charts.k8strilio.net/trilio-stable/k8s-triliovault-operator
helm install tvm triliovault-operator/k8s-triliovault-operator --set observability.enabled=true

Observability Stack Configurable Parameters

The following table lists the configuration parameters of the observability stack

Parameter
Description
Default

observability.enabled

observability stack is enabled

false

observability.name

observability name for T4K integration

tvk-integration

observability.logging.loki.enabled

logging stack, loki is enabled

true

observability.logging.loki.fullnameOverride

name of the loki service

"loki"

observability.logging.loki.singleBinary.persistence.enabled

loki persistence storage enabled

true

observability.logging.loki.singleBinary.persistence.accessModes

loki persistence storage accessModes

ReadWriteOnce

observability.logging.loki.singleBinary.persistence.size

loki persistence storage size

10Gi

observability.logging.loki.loki.limits_config.reject_old_samples_max_age

loki config, maximum accepted sample age before rejecting

168h

observability.logging.loki.tableManager.retention_period

loki config, how far back tables will be kept before they are deleted. 0s disables deletion.

168h

observability.logging.promtail.enabled

logging stack, promtail is enabled

true

observability.logging.promtail.fullnameOverride

name of the promtail service

"promtail"

observability.logging.promtail.config.clients.url

loki url for promtail integration

observability.monitoring.prometheus.enabled

monitoring stack, prometheus is enabled

true

observability.monitoring.prometheus.fullnameOverride

name of the prometheus service

"prom"

observability.monitoring.prometheus.server.enabled

prometheus server is enabled

true

observability.monitoring.prometheus.server.fullnameOverride

name of prometheus server service

"prom-server"

observability.monitoring.prometheus.server.persistentVolume.enabled

prometheus server with persistent volume is enabled

false

observability.monitoring.prometheus.kube-state-metrics.enabled

prometheus kube state metrics is enabled

false

observability.monitoring.prometheus.prometheus-node-exporter.enabled

prometheus node exporter is enabled

false

observability.monitoring.prometheus.prometheus-pushgateway.enabled

prometheus push gateway is enabled

false

observability.monitoring.prometheus.alertmanager.enabled

prometheus alert manager is enabled

false

observability.visualization.grafana.enabled

visualization stack, grafana is enabled

true

observability.visualization.grafana.adminPassword

grafana password for admin user

"admin123"

observability.visualization.grafana.fullnameOverride

name of grafana service

"grafana"

observability.visualization.grafana.service.type

grafana service type

"ClusterIP"

Check the observability stack configuration by running the following command:

Enabling ServiceMonitor for T4K Metrics

The T4K exporter exposes Prometheus metrics on port 8080. You can enable a ServiceMonitor for automatic metrics discovery by Prometheus.

Enabling ServiceMonitor via Helm

Enable ServiceMonitor during T4K installation or upgrade:

Or upgrade an existing installation:

ServiceMonitor Configuration Parameters

Parameter
Description
Default

installTVK.exporter.enabled

Enable/disable the metrics exporter

true

installTVK.exporter.serviceMonitor.enabled

Enable Prometheus ServiceMonitor for metrics collection

false

installTVK.exporter.resources.requests.cpu

CPU request for exporter pod

50m

installTVK.exporter.resources.requests.memory

Memory request for exporter pod

512Mi

When ServiceMonitor is enabled, the Helm chart creates:

  • A Service exposing the exporter metrics on port 8080

  • A ServiceMonitor resource that configures Prometheus to scrape metrics from the exporter

circle-info

When exporter.serviceMonitor.enabled is set to false (default), the exporter pod includes Prometheus scrape annotations:

  • prometheus.io/scrape: "true"

  • prometheus.io/path: /metrics

  • prometheus.io/port: "8080"

If your Prometheus is configured to discover targets via pod annotations, metrics will be collected automatically without a ServiceMonitor.

Verifying Metrics Collection

After enabling the ServiceMonitor, verify that Prometheus is scraping T4K metrics:

  1. Access Prometheus or Grafana UI

  2. Query for T4K metrics:

  3. You should see metrics with labels like backup, backupplan, resource_namespace, etc.

Alertmanager Configuration

Alertmanager handles alerts sent by Prometheus server and manages routing, grouping, and notification. The observability stack includes Alertmanager as a sub-chart that can be enabled for T4K monitoring.

Enabling Alertmanager

To enable Alertmanager with the observability stack, set the following parameter during installation:

Alertmanager Configurable Parameters

The following table lists the Alertmanager-specific configuration parameters:

Parameter
Description
Default

observability.monitoring.prometheus.alertmanager.enabled

Enable Alertmanager

false

observability.monitoring.prometheus.alertmanager.image.repository

Alertmanager container image repository

quay.io/prometheus/alertmanager

observability.monitoring.prometheus.alertmanager.configmapReload.image.repository

Alertmanager configmap reload image repository

quay.io/prometheus-operator/prometheus-config-reloader

observability.monitoring.prometheus.alertmanager.replicaCount

Number of Alertmanager replicas

1

observability.monitoring.prometheus.alertmanager.persistence.enabled

Enable persistent storage for Alertmanager

true

observability.monitoring.prometheus.alertmanager.persistence.size

Alertmanager persistent volume size

50Mi

observability.monitoring.prometheus.alertmanager.persistence.accessModes

Alertmanager persistent volume access modes

ReadWriteOnce

observability.monitoring.prometheus.alertmanager.service.type

Alertmanager service type

ClusterIP

observability.monitoring.prometheus.alertmanager.service.port

Alertmanager service port

9093

observability.monitoring.prometheus.alertmanager.ingress.enabled

Enable ingress for Alertmanager

false

Minimal Alertmanager Configuration

The following is a minimal Alertmanager configuration sample with basic routing:

Install with:

Example: Alertmanager with Slack Notifications

The following example demonstrates how to configure Alertmanager to send alerts to a Slack channel:

Install with the custom values file:

Example: Alertmanager with Email Notifications

The following example demonstrates how to configure Alertmanager to send alerts via email:

Example: Alertmanager with PagerDuty Integration

The following example demonstrates how to configure Alertmanager with PagerDuty for incident management:

Example: Alertmanager with Custom Templates

Alertmanager templates allow you to customize the format and content of notifications. The following example demonstrates how to create custom templates for T4K alerts:

Template Functions Reference

The following template functions are commonly used in Alertmanager templates:

Function
Description
Example

toUpper

Converts string to uppercase

{{ .Status | toUpper }}

toLower

Converts string to lowercase

{{ .Labels.severity | toLower }}

title

Converts string to title case

{{ .Labels.alertname | title }}

join

Joins list elements with separator

{{ .Labels.Values | join ", " }}

safeHtml

Marks string as safe HTML

{{ .Annotations.description | safeHtml }}

reReplaceAll

Regex replace

{{ reReplaceAll "(.*):(.*)" "$1" .Labels.instance }}

Template Variables

Common variables available in templates:

Variable
Description

.Status

Alert status ("firing" or "resolved")

.Alerts

List of all alerts in the group

.Alerts.Firing

List of currently firing alerts

.Alerts.Resolved

List of resolved alerts

.CommonLabels

Labels common to all alerts

.CommonAnnotations

Annotations common to all alerts

.ExternalURL

URL to Alertmanager

.GroupLabels

Labels used for grouping

Example: Using Kubernetes Secrets for Credentials

For production environments, it's recommended to store sensitive credentials (like Slack webhook URLs, SMTP passwords, or PagerDuty keys) in Kubernetes Secrets instead of hardcoding them in helm values.

Step 1: Create Kubernetes Secret

First, create a secret containing your sensitive credentials:

Or using a YAML manifest:

Apply the secret:

Step 2: Configure Alertmanager to Use Secret

Configure Alertmanager to mount the secret and reference credentials from environment variables or files:

Secret File Reference Options

Alertmanager supports _file suffix for many credential fields, which reads the value from a file:

Original Field
File Reference Field
Description

slack_api_url

slack_api_url_file

Global Slack webhook URL

api_url

api_url_file

Per-receiver Slack webhook URL

smtp_auth_password

smtp_auth_password_file

SMTP password

smtp_auth_identity

smtp_auth_identity_file

SMTP identity

smtp_auth_secret

smtp_auth_secret_file

SMTP secret

service_key

service_key_file

PagerDuty service key

routing_key

routing_key_file

PagerDuty routing key

token

token_file

Opsgenie/VictorOps token

url

url_file

Webhook URL

Example: Complete Setup with External Secrets Operator

For organizations using External Secrets Operator (ESO) to sync secrets from external secret managers (AWS Secrets Manager, HashiCorp Vault, etc.):

Apply the ExternalSecret:

The External Secrets Operator will automatically create and sync the alertmanager-secrets Kubernetes Secret from your external secret manager.

circle-exclamation

Example: Complete Observability Stack with Alertmanager

The following is a comprehensive example enabling the full observability stack with Alertmanager, custom alerting rules, and persistent storage:

Install with:

Verifying Alertmanager Installation

After installation, verify that Alertmanager is running:

Access the Alertmanager UI:

Then open your browser to http://localhost:9093 to view the Alertmanager UI.

T4K Metrics Reference

TrilioVault for Kubernetes (T4K) exports Prometheus metrics through the k8s-triliovault-exporter component. These metrics can be used for monitoring, alerting, and dashboarding.

Metric Value Conventions

For status-based metrics (*_info metrics), the numeric value indicates the status:

Status
Metric Value
Description

Available / Completed

1

Resource is healthy/successful

Failed / Error

-1

Resource has failed

InProgress

0

Operation is in progress

Empty/Unknown

-2

Status not yet determined

Available Metrics

Backup Metrics

Metric Name
Description
Key Labels

trilio_backup_info

Backup status and metadata

backup, backupplan, resource_namespace, status, target, backup_type, start_ts, completion_ts, size, cluster, kind, hook, backupscope, applicationtype

trilio_backup_storage

Backup size in bytes

backup, backupplan, resource_namespace, status, target, backup_type, cluster, kind

trilio_backup_status_percentage

Backup progress (0-100)

backup, backupplan, resource_namespace, status, target, backup_type, cluster, kind

trilio_backup_completed_duration

Backup duration in minutes (only for completed backups)

backup, backupplan, resource_namespace, status, target, backup_type, cluster, kind

trilio_backup_metadata_info

Detailed backup object metadata

backup, backupplan, resource_namespace, status, objecttype, objectname, backupscope, applicationtype, apiversion, apigroup, object_resource

Restore Metrics

Metric Name
Description
Key Labels

trilio_restore_info

Restore status and metadata

restore, backup, backupplan, resource_namespace, status, target, size, start_ts, completion_ts, cluster, kind

trilio_restore_status_percentage

Restore progress (0-100)

restore, backup, resource_namespace, status, target, cluster, kind

trilio_restore_completed_duration

Restore duration in minutes (only for completed restores)

restore, backup, resource_namespace, status, target, cluster, kind

Target Metrics

Metric Name
Description
Key Labels

trilio_target_info

Target availability status (1=available, 0=unavailable)

target, resource_namespace, status, vendor, vendorType, browsing, eventTarget, size, threshold_capacity, creation_ts, cluster

trilio_target_storage

Storage used by target in bytes

target, resource_namespace, status, vendor, vendorType, threshold_capacity, creation_ts, cluster

BackupPlan Metrics

Metric Name
Description
Key Labels

trilio_backupplan_info

BackupPlan status and summary

backupplan, resource_namespace, status, target, protected, backup_count, lastprotected, backupscope, applicationtype, creation_ts, cluster, kind

trilio_backupplan_crstatus

BackupPlan continuous restore status

backupplan, continuousrestoreinstance, continuousrestore_enabled, continuousrestoreplan, consistentset_count, cr_status, cluster, kind

Continuous Restore Metrics

Metric Name
Description
Key Labels

trilio_continuousrestoreplan_info

ContinuousRestorePlan status

continuousrestoreplan, continuousrestorepolicy, target, consistentsetcount, sourcebackupplan, sourceinstanceinfo, status, creation_ts, cluster, kind

trilio_consistentset_info

ConsistentSet status and details

consistentset, consistentsetscope, continuousrestoreplan, sourcebackupplan, sourceinstanceinfo, backupName, backupNamespace, backupStatus, backupSize, status, size, cluster, kind

trilio_consistentset_status_percentage

ConsistentSet progress (0-100)

consistentset, consistentsetscope, continuousrestoreplan, sourcebackupplan, sourceinstanceinfo, backupName, status, cluster, kind

trilio_consistentset_completed_duration

ConsistentSet duration in minutes

consistentset, consistentsetscope, continuousrestoreplan, sourcebackupplan, sourceinstanceinfo, backupName, status, cluster, kind

Example PromQL Queries

Viewing Alert Rules in Grafana

Once alert rules are configured, you can view and manage them directly from the Grafana UI. Navigate to Alerting > Alert rules to see all configured rules, their current state, and firing alerts.

The Alert rules page shows:

  • Data source-managed rules: Alert rules defined in Prometheus configuration (e.g., /etc/config/alerting_rules.yml)

  • State: Current state of each alert (Firing, Normal, Pending, Recovering)

  • Health: Health status of the alert rule

  • Summary: Brief description of what the alert monitors

You can filter alerts by data source, dashboard, state, rule type, health status, and contact point.

View Logs From T4K UI

  • Login to T4K UI with preferred authentication

  • Select "Launch Event Viewer" on any required service or application

Launch Event Viewer option
  • On click on "Launch Event Viewer" option, user will be redirected to Logs visibility page.

Logs page

Accessing Grafana Dashboards

circle-info

if a custom path is configured then:

Grafana Endpoint : http://<T4K_IP>/<custom-path>/grafana

Additional Monitoring Components

Kube-State-Metrics

Kube-state-metrics generates metrics about the state of Kubernetes objects. Enable it to get comprehensive cluster metrics:

Node Exporter

Node Exporter exposes hardware and OS metrics from the host machines:

Pushgateway

Pushgateway allows ephemeral and batch jobs to expose metrics to Prometheus:

Last updated