Healthcheck of TrilioVault

TrilioVault is composed out of multiple services, which can be checked in case of any errors.

On the TrilioVault Cluster

wlm-workloads

This service runs and is active on every TrilioVault node.

[[email protected] ~]# systemctl status wlm-workloads
● wlm-workloads.service - workloadmanager workloads service
Loaded: loaded (/etc/systemd/system/wlm-workloads.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2020-06-10 13:42:42 UTC; 1 weeks 4 days ago
Main PID: 12779 (workloadmgr-wor)
Tasks: 17
CGroup: /system.slice/wlm-workloads.service
├─12779 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-workloads --config-file=/etc/workloadmgr/workloadmgr.conf
├─12982 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-workloads --config-file=/etc/workloadmgr/workloadmgr.conf
├─12983 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-workloads --config-file=/etc/workloadmgr/workloadmgr.conf
├─12984 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-workloads --config-file=/etc/workloadmgr/workloadmgr.conf
[...]

wlm-api

This service runs and is active on every TrilioVault node.

[[email protected] ~]# systemctl status wlm-api
● wlm-api.service - Cluster Controlled wlm-api
Loaded: loaded (/etc/systemd/system/wlm-api.service; disabled; vendor preset: disabled)
Drop-In: /run/systemd/system/wlm-api.service.d
└─50-pacemaker.conf
Active: active (running) since Thu 2020-04-16 22:30:11 UTC; 2 months 5 days ago
Main PID: 11815 (workloadmgr-api)
Tasks: 1
CGroup: /system.slice/wlm-api.service
└─11815 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-api --config-file=/etc/workloadmgr/workloadmgr.conf

wlm-scheduler

This service runs and is active on every TrilioVault node.

[[email protected] ~]# systemctl status wlm-scheduler
● wlm-scheduler.service - Cluster Controlled wlm-scheduler
Loaded: loaded (/etc/systemd/system/wlm-scheduler.service; disabled; vendor preset: disabled)
Drop-In: /run/systemd/system/wlm-scheduler.service.d
└─50-pacemaker.conf
Active: active (running) since Thu 2020-04-02 13:49:22 UTC; 2 months 20 days ago
Main PID: 29439 (workloadmgr-sch)
Tasks: 1
CGroup: /system.slice/wlm-scheduler.service
└─29439 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-scheduler --config-file=/etc/workloadmgr/workloadmgr.conf

wlm-cron

This service is controlled by pacemaker and runs only on the master node

[[email protected] ~]# systemctl status wlm-cron
● wlm-cron.service - Cluster Controlled wlm-cron
Loaded: loaded (/etc/systemd/system/wlm-cron.service; disabled; vendor preset: disabled)
Drop-In: /run/systemd/system/wlm-cron.service.d
└─50-pacemaker.conf
Active: active (running) since Wed 2021-01-27 19:59:26 UTC; 6 days ago
Main PID: 23071 (workloadmgr-cro)
CGroup: /system.slice/wlm-cron.service
├─23071 /home/stack/myansible/bin/python3 /home/stack/myansible/bin/workloadmgr-cron --config-file=/etc/workloadmgr/workloadmgr.conf
└─23248 /home/stack/myansible/bin/python3 /home/stack/myansible/bin/workloadmgr-cron --config-file=/etc/workloadmgr/workloadmgr.conf
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: ● wlm-cron.service - Cluster Controlled wlm-cron
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: Loaded: loaded (/etc/systemd/system/wlm-cron.service; disabled; vendor preset: disabled)
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: Drop-In: /run/systemd/system/wlm-cron.service.d
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: └─50-pacemaker.conf
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: Active: active (running) since Wed 2021-01-27 19:59:26 UTC; 6 days ago
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: Main PID: 23071 (workloadmgr-cro)
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: CGroup: /system.slice/wlm-cron.service
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: ├─23071 /home/stack/myansible/bin/python3 /home/stack/myansible/bin/workloadmgr-cron --config-file=/etc/workloadmgr/workloadmgr.conf
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: ├─23248 /home/stack/myansible/bin/python3 /home/stack/myansible/bin/workloadmgr-cron --config-file=/etc/workloadmgr/workloadmgr.conf
Feb 03 19:28:43 tvm1-ansible-ussuri-ubuntu18-vagrant workloadmgr-cron[23071]: └─27145 /usr/bin/systemctl status wlm-cron

Pacemaker Cluster Status

the pacemaker cluster is controlling and watching the VIP on the TrilioVault Cluster. It also controls on which node the wlm-api and wlm-scheduler service runs.

[[email protected] ~]# pcs status
Cluster name: triliovault
WARNINGS:
Corosync and pacemaker node names do not match (IPs used in setup?)
Stack: corosync
Current DC: tvm1-ansible-ussuri-ubuntu18-vagrant (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Wed Feb 3 19:20:02 2021
Last change: Wed Jan 27 20:00:12 2021 by root via crm_resource on tvm1-ansible-ussuri-ubuntu18-vagrant
1 node configured
6 resource instances configured
Online: [ tvm1-ansible-ussuri-ubuntu18-vagrant ]
Full list of resources:
virtual_ip (ocf::heartbeat:IPaddr2): Started tvm1-ansible-ussuri-ubuntu18-vagrant
virtual_ip_public (ocf::heartbeat:IPaddr2): Started tvm1-ansible-ussuri-ubuntu18-vagrant
virtual_ip_admin (ocf::heartbeat:IPaddr2): Started tvm1-ansible-ussuri-ubuntu18-vagrant
virtual_ip_internal (ocf::heartbeat:IPaddr2): Started tvm1-ansible-ussuri-ubuntu18-vagrant
wlm-cron (systemd:wlm-cron): Started tvm1-ansible-ussuri-ubuntu18-vagrant
Clone Set: lb_nginx-clone [lb_nginx]
Started: [ tvm1-ansible-ussuri-ubuntu18-vagrant ]
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Mount availability

The TrilioVault Cluster needs access to the Backup Target and should have the correct mount at all times.

Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.8G 0 3.8G 0% /dev
tmpfs 3.8G 38M 3.8G 1% /dev/shm
tmpfs 3.8G 427M 3.4G 12% /run
tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup
/dev/vda1 40G 8.8G 32G 22% /
tmpfs 773M 0 773M 0% /run/user/996
tmpfs 773M 0 773M 0% /run/user/0
10.10.2.20:/upstream 1008G 704G 254G 74% /var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW0=
10.10.2.20:/upstream2 483G 22G 462G 5% /var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW0y

The dmapi service

The dmapi service has its own Keystone endpoints, which should be checked in addition to the actual service status.

[[email protected] ~(keystone_admin)]# openstack endpoint list | grep dmapi
| 47918c8df8854ed49c082e398a9572be | RegionOne | dmapi | datamover | True | public | http://10.10.2.10:8784/v2 |
| cca52aff6b2a4f47bcc84b34647fba71 | RegionOne | dmapi | datamover | True | internal | http://10.10.2.10:8784/v2 |
| e9aa6630bfb74a9bb7562d4161f4e07d | RegionOne | dmapi | datamover | True | admin | http://10.10.2.10:8784/v2 |
[[email protected] ~(keystone_admin)]# curl http://10.10.2.10:8784/v2
{"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}
[[email protected] ~(keystone_admin)]# systemctl status tvault-datamover-api.service
● tvault-datamover-api.service - TrilioData DataMover API service
Loaded: loaded (/etc/systemd/system/tvault-datamover-api.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2020-04-12 12:31:11 EDT; 2 months 9 days ago
Main PID: 11252 (python)
Tasks: 2
CGroup: /system.slice/tvault-datamover-api.service
├─11252 /usr/bin/python /usr/bin/dmapi-api
└─11280 /usr/bin/python /usr/bin/dmapi-api

The datamover service

The datamover service is running on each compute node and is integrated as nova compute service.

[[email protected] ~(keystone_admin)]# openstack compute service list
+----+----------------------+--------------------+----------+----------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+----------------------+--------------------+----------+----------+-------+----------------------------+
| 7 | nova-conductor | upstreamcontroller | internal | enabled | up | 2020-06-22T11:01:05.000000 |
| 8 | nova-scheduler | upstreamcontroller | internal | enabled | up | 2020-06-22T11:01:04.000000 |
| 9 | nova-consoleauth | upstreamcontroller | internal | enabled | up | 2020-06-22T11:01:01.000000 |
| 10 | nova-compute | upstreamcompute1 | US-East | enabled | up | 2020-06-22T11:01:09.000000 |
| 11 | nova-compute | upstreamcompute2 | US-West | enabled | up | 2020-06-22T11:01:09.000000 |
| 16 | nova-contego_3.0.172 | upstreamcompute2 | internal | enabled | up | 2020-06-22T11:01:07.000000 |
| 17 | nova-contego_3.0.172 | upstreamcompute1 | internal | enabled | up | 2020-06-22T11:01:02.000000 |
+----+----------------------+--------------------+----------+----------+-------+----------------------------+
[[email protected] ~]# systemctl status tvault-contego
● tvault-contego.service - Tvault contego
Loaded: loaded (/etc/systemd/system/tvault-contego.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2020-06-10 10:07:28 EDT; 1 weeks 4 days ago
Main PID: 10384 (python)
Tasks: 21
CGroup: /system.slice/tvault-contego.service
└─10384 /usr/bin/python /usr/bin/tvault-contego --config-file=/etc/nova/nova.conf --config-file=/etc/tvault-contego/tvault-contego.conf