frostfs_node_grpc_server_health
metric shows 1 for addresses on interfaces which are down #1102
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#1102
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Expected Behavior
When an interface which gRPC server listens on becomes down, frostfs_node_grpc_server_health for that interface's address becomes 0.
Current Behavior
When an interface which gRPC server listens on becomes down, frostfs_node_grpc_server_health for that interface's address remains 1 even after several days.
Possible Solution
No fix can be suggested by a QA engineer. Further solutions shall be up to developers.
Steps to Reproduce (for bugs)
networkctl down internal0 internal1
But the endpoint on 192.168.198.183 is not available:
Context
This prevents alerts based on this metric from firing in the cases when gRPC endpoints are expected to be available but in fact are not because the interfaces they listen on are not available.
Regression
Unknown.
Your Environment
Virtual CYP.
I want to discuss this a bit. Linux supports binding to non-local IP and
bind
is a separate syscall fromaccept
, so I doubt we can receive an error when the interface goes down.Let's check if after reassigning the IP to the service, new connections can still be accepted.
We could monitor sockets with netlink and manually close servers, but this is an entirely new feature.
Once the interfaces are up again, the endpoint seems to be working, although my command still failed, not sure if this problem is related:
frostfs_node_grpc_server_health metric shows 1 for addresses on interfaces which are downtofrostfs_node_grpc_server_health
metric shows 1 for addresses on interfaces which are down