Extend healthcheck statuses #1631

Open
opened 2025-02-04 07:29:26 +00:00 by fyrchik · 4 comments
Owner

We have multiple mechanisms for checking whether something has failed:

  1. ControlService.HealthCheck RPC
  2. systemctl status (via sdnotify)
  3. Metrics.

The first two of them only distinguish between started/not started. However, there are multiple events that we would like to know about:

  1. Shards that failed to initialize (refs #169).
  2. Network interfaces that failed to initialize (consider service listening only on localhost).

In this task we discuss, extend and unify all these mechanisms.
My suggestion is to have different healthcheck indicators: network, storage, connection to morph (sth else?)

  1. Sdnotify can transmit this in a STATUS string (https://www.man7.org/linux/man-pages/man3/sd_notify.3.html)
  2. For metrics combined healthcheck is omitted, not to duplicate vmalert rules or grafana dashboards.
  3. control service may provide more verbose output with detailed info about recent failures.

cc @potyarkin @realloc

We have multiple mechanisms for checking whether something has failed: 1. `ControlService.HealthCheck` RPC 2. `systemctl status` (via `sdnotify`) 3. Metrics. The first two of them only distinguish between started/not started. However, there are multiple events that we would like to know about: 1. Shards that failed to initialize (refs #169). 2. Network interfaces that failed to initialize (consider service listening only on localhost). In this task we discuss, extend and unify all these mechanisms. My suggestion is to have different healthcheck indicators: network, storage, connection to morph (sth else?) 1. Sdnotify can transmit this in a STATUS string (https://www.man7.org/linux/man-pages/man3/sd_notify.3.html) 2. For metrics combined healthcheck is omitted, not to duplicate vmalert rules or grafana dashboards. 3. control service may provide more verbose output with detailed info about recent failures. cc @potyarkin @realloc
fyrchik added the
enhancement
discussion
labels 2025-02-04 07:29:26 +00:00
Member

I support suggestions 2 and 3 (not duplicating alert functionality in metrics and expanding control service RPC output).

I'm hesitant about sd_notify STATUS string. Sure, it would be a nice thing to have but who and how often will be looking at it? Automated systems will probably pick up the same message from logs at the moment status change occurs, and any alerting will be based on that. How likely is it that human operators will have access to systemctl but will not have any frostfs-specific tooling that would query the control service RPC? Is this unlikely scenario worth the effort?

I support suggestions 2 and 3 (not duplicating alert functionality in metrics and expanding control service RPC output). I'm hesitant about sd_notify STATUS string. Sure, it would be a nice thing to have but who and how often will be looking at it? Automated systems will probably pick up the same message from logs at the moment status change occurs, and any alerting will be based on that. How likely is it that human operators will have access to systemctl but will not have any frostfs-specific tooling that would query the control service RPC? Is this unlikely scenario worth the effort?
Author
Owner

I'm hesitant about sd_notify STATUS string

My thoughts were that it is nice to have a well-known interface. Like when I install new app, and it fails to start, systemctl status comes naturally, vs reading man pages and app-specific commands. But I agree it looks clunky.

>I'm hesitant about sd_notify STATUS string My thoughts were that it is nice to have a well-known interface. Like when I install new app, and it fails to start, `systemctl status` comes naturally, vs reading man pages and app-specific commands. But I agree it looks clunky.
fyrchik added the
frostfs-node
label 2025-02-04 09:46:28 +00:00
Owner

Why do we distinguish between 1 and 2? It seems that both should provide similar information, allowing us to gather details from metrics or logs.

Why do we distinguish between 1 and 2? It seems that both should provide similar information, allowing us to gather details from metrics or logs.
Author
Owner

@realloc wrote in #1631 (comment):

Why do we distinguish between 1 and 2? It seems that both should provide similar information, allowing us to gather details from metrics or logs.

With sdnotify we are a bit more limited in scope, I don't see how we can add arbitrary level of detail here.

@realloc wrote in https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/1631#issuecomment-66308: > Why do we distinguish between 1 and 2? It seems that both should provide similar information, allowing us to gather details from metrics or logs. With `sdnotify` we are a bit more limited in scope, I don't see how we can add arbitrary level of detail here.
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1631
No description provided.