Extend healthcheck statuses #1631

New issue

Open

opened 2025-02-04 07:29:26 +00:00 by fyrchik · 4 comments

fyrchik commented

2025-02-04 07:29:26 +00:00

Owner

We have multiple mechanisms for checking whether something has failed:

ControlService.HealthCheck RPC
systemctl status (via sdnotify)
Metrics.

The first two of them only distinguish between started/not started. However, there are multiple events that we would like to know about:

Shards that failed to initialize (refs #169).
Network interfaces that failed to initialize (consider service listening only on localhost).

In this task we discuss, extend and unify all these mechanisms.
My suggestion is to have different healthcheck indicators: network, storage, connection to morph (sth else?)

Sdnotify can transmit this in a STATUS string (https://www.man7.org/linux/man-pages/man3/sd_notify.3.html)
For metrics combined healthcheck is omitted, not to duplicate vmalert rules or grafana dashboards.
control service may provide more verbose output with detailed info about recent failures.

cc @potyarkin @realloc

We have multiple mechanisms for checking whether something has failed: 1. `ControlService.HealthCheck` RPC 2. `systemctl status` (via `sdnotify`) 3. Metrics. The first two of them only distinguish between started/not started. However, there are multiple events that we would like to know about: 1. Shards that failed to initialize (refs #169). 2. Network interfaces that failed to initialize (consider service listening only on localhost). In this task we discuss, extend and unify all these mechanisms. My suggestion is to have different healthcheck indicators: network, storage, connection to morph (sth else?) 1. Sdnotify can transmit this in a STATUS string (https://www.man7.org/linux/man-pages/man3/sd_notify.3.html) 2. For metrics combined healthcheck is omitted, not to duplicate vmalert rules or grafana dashboards. 3. control service may provide more verbose output with detailed info about recent failures. cc @potyarkin @realloc

fyrchik added the

enhancement

discussion

labels 2025-02-04 07:29:26 +00:00

potyarkin commented

2025-02-04 08:27:05 +00:00

Member

I support suggestions 2 and 3 (not duplicating alert functionality in metrics and expanding control service RPC output).

I'm hesitant about sd_notify STATUS string. Sure, it would be a nice thing to have but who and how often will be looking at it? Automated systems will probably pick up the same message from logs at the moment status change occurs, and any alerting will be based on that. How likely is it that human operators will have access to systemctl but will not have any frostfs-specific tooling that would query the control service RPC? Is this unlikely scenario worth the effort?

I support suggestions 2 and 3 (not duplicating alert functionality in metrics and expanding control service RPC output). I'm hesitant about sd_notify STATUS string. Sure, it would be a nice thing to have but who and how often will be looking at it? Automated systems will probably pick up the same message from logs at the moment status change occurs, and any alerting will be based on that. How likely is it that human operators will have access to systemctl but will not have any frostfs-specific tooling that would query the control service RPC? Is this unlikely scenario worth the effort?

👀 1

fyrchik commented

2025-02-04 09:45:47 +00:00

Author

Owner

I'm hesitant about sd_notify STATUS string

My thoughts were that it is nice to have a well-known interface. Like when I install new app, and it fails to start, systemctl status comes naturally, vs reading man pages and app-specific commands. But I agree it looks clunky.

>I'm hesitant about sd_notify STATUS string My thoughts were that it is nice to have a well-known interface. Like when I install new app, and it fails to start, `systemctl status` comes naturally, vs reading man pages and app-specific commands. But I agree it looks clunky.

fyrchik added the

frostfs-node

label 2025-02-04 09:46:28 +00:00

realloc commented

2025-02-04 10:23:25 +00:00

Owner

Why do we distinguish between 1 and 2? It seems that both should provide similar information, allowing us to gather details from metrics or logs.

fyrchik commented

2025-02-04 10:25:09 +00:00

Author

Owner

@realloc wrote in #1631 (comment):

Why do we distinguish between 1 and 2? It seems that both should provide similar information, allowing us to gather details from metrics or logs.

With sdnotify we are a bit more limited in scope, I don't see how we can add arbitrary level of detail here.

@realloc wrote in https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/1631#issuecomment-66308: > Why do we distinguish between 1 and 2? It seems that both should provide similar information, allowing us to gather details from metrics or logs. With `sdnotify` we are a bit more limited in scope, I don't see how we can add arbitrary level of detail here.

👍 1