Extend healthcheck statuses #1631
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#1631
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We have multiple mechanisms for checking whether something has failed:
ControlService.HealthCheck
RPCsystemctl status
(viasdnotify
)The first two of them only distinguish between started/not started. However, there are multiple events that we would like to know about:
In this task we discuss, extend and unify all these mechanisms.
My suggestion is to have different healthcheck indicators: network, storage, connection to morph (sth else?)
cc @potyarkin @realloc
I support suggestions 2 and 3 (not duplicating alert functionality in metrics and expanding control service RPC output).
I'm hesitant about sd_notify STATUS string. Sure, it would be a nice thing to have but who and how often will be looking at it? Automated systems will probably pick up the same message from logs at the moment status change occurs, and any alerting will be based on that. How likely is it that human operators will have access to systemctl but will not have any frostfs-specific tooling that would query the control service RPC? Is this unlikely scenario worth the effort?
My thoughts were that it is nice to have a well-known interface. Like when I install new app, and it fails to start,
systemctl status
comes naturally, vs reading man pages and app-specific commands. But I agree it looks clunky.Why do we distinguish between 1 and 2? It seems that both should provide similar information, allowing us to gather details from metrics or logs.
@realloc wrote in #1631 (comment):
With
sdnotify
we are a bit more limited in scope, I don't see how we can add arbitrary level of detail here.