Healthcheck of Trilio

Trilio is composed out of multiple services, which can be checked in case of any errors.

On the Trilio Cluster

wlm-workloads

This service runs and is active on every Trilio node.

[root@Upstream ~]# systemctl status wlm-workloads
● wlm-workloads.service - workloadmanager workloads service
   Loaded: loaded (/etc/systemd/system/wlm-workloads.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-06-10 13:42:42 UTC; 1 weeks 4 days ago
 Main PID: 12779 (workloadmgr-wor)
    Tasks: 17
   CGroup: /system.slice/wlm-workloads.service
           ├─12779 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-workloads --config-file=/etc/workloadmgr/workloadmgr.conf
           ├─12982 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-workloads --config-file=/etc/workloadmgr/workloadmgr.conf
           ├─12983 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-workloads --config-file=/etc/workloadmgr/workloadmgr.conf
           ├─12984 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-workloads --config-file=/etc/workloadmgr/workloadmgr.conf
           [...]

wlm-api

This service runs on the Master Node of the Trilio Cluster.

[root@Upstream ~]# systemctl status wlm-api
● wlm-api.service - Cluster Controlled wlm-api
   Loaded: loaded (/etc/systemd/system/wlm-api.service; disabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/wlm-api.service.d
           └─50-pacemaker.conf
   Active: active (running) since Thu 2020-04-16 22:30:11 UTC; 2 months 5 days ago
 Main PID: 11815 (workloadmgr-api)
    Tasks: 1
   CGroup: /system.slice/wlm-api.service
           └─11815 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-api --config-file=/etc/workloadmgr/workloadmgr.conf

wlm-scheduler

This service runs on the Master Node of the Trilio Cluster

[root@Upstream ~]# systemctl status wlm-scheduler
● wlm-scheduler.service - Cluster Controlled wlm-scheduler
   Loaded: loaded (/etc/systemd/system/wlm-scheduler.service; disabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/wlm-scheduler.service.d
           └─50-pacemaker.conf
   Active: active (running) since Thu 2020-04-02 13:49:22 UTC; 2 months 20 days ago
 Main PID: 29439 (workloadmgr-sch)
    Tasks: 1
   CGroup: /system.slice/wlm-scheduler.service
           └─29439 /home/stack/myansible/bin/python /home/stack/myansible/bin/workloadmgr-scheduler --config-file=/etc/workloadmgr/workloadmgr.conf

Pacemaker Cluster Status

the pacemaker cluster is controlling and watching the VIP on the Trilio Cluster. It also controls on which node the wlm-api and wlm-scheduler service runs.

[root@Upstream ~]# pcs status
Cluster name: triliovault

Stack: corosync
Current DC: Upstream2 (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
Last updated: Mon Jun 22 10:52:19 2020
Last change: Thu Apr 16 22:30:11 2020 by root via cibadmin on Upstream

3 nodes configured
9 resources configured

Online: [ Upstream Upstream2 Upstream3 ]

Full list of resources:

 virtual_ip     (ocf::heartbeat:IPaddr2):       Started Upstream
 virtual_ip_public      (ocf::heartbeat:IPaddr2):       Started Upstream
 virtual_ip_admin       (ocf::heartbeat:IPaddr2):       Started Upstream
 virtual_ip_internal    (ocf::heartbeat:IPaddr2):       Started Upstream
 wlm-api        (systemd:wlm-api):      Started Upstream
 wlm-scheduler  (systemd:wlm-scheduler):        Started Upstream
 Clone Set: lb_nginx-clone [lb_nginx]
     Started: [ Upstream ]
     Stopped: [ Upstream2 Upstream3 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Mount availability

The Trilio Cluster needs access to the Backup Target and should have the correct mount at all times.

[root@Upstream ~]# df -h
Filesystem             Size  Used Avail Use% Mounted on
devtmpfs               3.8G     0  3.8G   0% /dev
tmpfs                  3.8G   38M  3.8G   1% /dev/shm
tmpfs                  3.8G  427M  3.4G  12% /run
tmpfs                  3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/vda1               40G  8.8G   32G  22% /
tmpfs                  773M     0  773M   0% /run/user/996
tmpfs                  773M     0  773M   0% /run/user/0
10.10.2.20:/upstream  1008G  704G  254G  74% /var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW0=
10.10.2.20:/upstream2  483G   22G  462G   5% /var/triliovault-mounts/MTAuMTAuMi4yMDovdXBzdHJlYW0y

The dmapi service

The dmapi service has its own Keystone endpoints, which should be checked in addition to the actual service status.

[root@upstreamcontroller ~(keystone_admin)]# openstack endpoint list | grep dmapi
| 47918c8df8854ed49c082e398a9572be | RegionOne | dmapi          | datamover    | True    | public    | http://10.10.2.10:8784/v2                    |
| cca52aff6b2a4f47bcc84b34647fba71 | RegionOne | dmapi          | datamover    | True    | internal  | http://10.10.2.10:8784/v2                    |
| e9aa6630bfb74a9bb7562d4161f4e07d | RegionOne | dmapi          | datamover    | True    | admin     | http://10.10.2.10:8784/v2                    |

[root@upstreamcontroller ~(keystone_admin)]# curl http://10.10.2.10:8784/v2
{"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}
[root@upstreamcontroller ~(keystone_admin)]# systemctl status tvault-datamover-api.service
● tvault-datamover-api.service - TrilioData DataMover API service
   Loaded: loaded (/etc/systemd/system/tvault-datamover-api.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2020-04-12 12:31:11 EDT; 2 months 9 days ago
 Main PID: 11252 (python)
    Tasks: 2
   CGroup: /system.slice/tvault-datamover-api.service
           ├─11252 /usr/bin/python /usr/bin/dmapi-api
           └─11280 /usr/bin/python /usr/bin/dmapi-api

The datamover service

The datamover service is running on each compute node and is integrated as nova compute service.

[root@upstreamcontroller ~(keystone_admin)]# openstack compute service list
+----+----------------------+--------------------+----------+----------+-------+----------------------------+
| ID | Binary               | Host               | Zone     | Status   | State | Updated At                 |
+----+----------------------+--------------------+----------+----------+-------+----------------------------+
|  7 | nova-conductor       | upstreamcontroller | internal | enabled  | up    | 2020-06-22T11:01:05.000000 |
|  8 | nova-scheduler       | upstreamcontroller | internal | enabled  | up    | 2020-06-22T11:01:04.000000 |
|  9 | nova-consoleauth     | upstreamcontroller | internal | enabled  | up    | 2020-06-22T11:01:01.000000 |
| 10 | nova-compute         | upstreamcompute1   | US-East  | enabled  | up    | 2020-06-22T11:01:09.000000 |
| 11 | nova-compute         | upstreamcompute2   | US-West  | enabled  | up    | 2020-06-22T11:01:09.000000 |
| 16 | nova-contego_3.0.172 | upstreamcompute2   | internal | enabled  | up    | 2020-06-22T11:01:07.000000 |
| 17 | nova-contego_3.0.172 | upstreamcompute1   | internal | enabled  | up    | 2020-06-22T11:01:02.000000 |
+----+----------------------+--------------------+----------+----------+-------+----------------------------+
[root@upstreamcompute1 ~]# systemctl status tvault-contego
● tvault-contego.service - Tvault contego
   Loaded: loaded (/etc/systemd/system/tvault-contego.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-06-10 10:07:28 EDT; 1 weeks 4 days ago
 Main PID: 10384 (python)
    Tasks: 21
   CGroup: /system.slice/tvault-contego.service
           └─10384 /usr/bin/python /usr/bin/tvault-contego --config-file=/etc/nova/nova.conf --config-file=/etc/tvault-contego/tvault-contego.conf

Last updated