`frostfs_node_grpc_server_health` metric shows 1 for addresses on interfaces which are down #1102

New issue

Open

opened 2024-04-22 09:55:01 +00:00 by an-nikitin · 2 comments

an-nikitin commented

2024-04-22 09:55:01 +00:00

Expected Behavior

When an interface which gRPC server listens on becomes down, frostfs_node_grpc_server_health for that interface's address becomes 0.

Current Behavior

When an interface which gRPC server listens on becomes down, frostfs_node_grpc_server_health for that interface's address remains 1 even after several days.

Possible Solution

No fix can be suggested by a QA engineer. Further solutions shall be up to developers.

Steps to Reproduce (for bugs)

Make internal interfaces go down:
networkctl down internal0 internal1
Wait for several minutes, then check the metric:

service@annikitin-node2[alone_datacenter]:~$ sudo -i ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: mgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 86:5b:00:b9:41:f1 brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    inet 10.78.132.183/23 metric 1024 brd 10.78.133.255 scope global mgmt
       valid_lft forever preferred_lft forever
    inet6 fe80::845b:ff:feb9:41f1/64 scope link
       valid_lft forever preferred_lft forever
3: data0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000
    link/ether ea:1f:30:3c:65:be brd ff:ff:ff:ff:ff:ff
    altname enp0s19
    inet 10.78.128.183/22 brd 10.78.131.255 scope global data0
       valid_lft forever preferred_lft forever
    inet6 fe80::e81f:30ff:fe3c:65be/64 scope link
       valid_lft forever preferred_lft forever
4: data1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000
    link/ether ce:2e:83:b6:f5:95 brd ff:ff:ff:ff:ff:ff
    altname enp0s20
    inet 10.78.129.183/22 brd 10.78.131.255 scope global data1
       valid_lft forever preferred_lft forever
    inet6 fe80::cc2e:83ff:feb6:f595/64 scope link
       valid_lft forever preferred_lft forever
5: internal0: <BROADCAST,MULTICAST> mtu 9000 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 8a:53:8c:c0:40:cf brd ff:ff:ff:ff:ff:ff
    altname enp0s21
6: internal1: <BROADCAST,MULTICAST> mtu 9000 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 26:b9:8b:02:da:7e brd ff:ff:ff:ff:ff:ff
    altname enp0s22
7: service: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 66:1f:8e:03:e5:ed brd ff:ff:ff:ff:ff:ff
    altname enp0s23
    inet 192.168.254.253/29 brd 192.168.254.255 scope global service
       valid_lft forever preferred_lft forever
    inet6 fe80::641f:8eff:fe03:e5ed/64 scope link
       valid_lft forever preferred_lft forever
8: ens1: <BROADCAST,MULTICAST> mtu 9000 qdisc noop state DOWN group default qlen 1000
    link/ether d2:c3:d6:b4:f7:86 brd ff:ff:ff:ff:ff:ff
    altname enp1s1
service@annikitin-node2[alone_datacenter]:~$ curl http://127.0.0.1:6672/metrics | grep frostfs_node_grpc_server_health
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP frostfs_node_grpc_server_health GRPC Server Endpoint health
# TYPE frostfs_node_grpc_server_health gauge
frostfs_node_grpc_server_health{endpoint="10.78.128.183:8080"} 1
frostfs_node_grpc_server_health{endpoint="10.78.128.183:8081"} 1
frostfs_node_grpc_server_health{endpoint="10.78.129.183:8080"} 1
frostfs_node_grpc_server_health{endpoint="10.78.129.183:8081"} 1
frostfs_node_grpc_server_health{endpoint="127.0.0.1:8080"} 1
frostfs_node_grpc_server_health{endpoint="192.168.198.183:8080"} 1
frostfs_node_grpc_server_health{endpoint="192.168.198.183:8081"} 1
frostfs_node_grpc_server_health{endpoint="192.168.199.183:8080"} 1
frostfs_node_grpc_server_health{endpoint="192.168.199.183:8081"} 1
100 1010k    0 1010k    0     0  47.8M      0 --:--:-- --:--:-- --:--:-- 49.3M

But the endpoint on 192.168.198.183 is not available:

service@annikitin-node1[alone_datacenter]:~$ sudo frostfs-cli --config /etc/frostfs/storage/control.yml container create --name test1 --policy "REP 2" --basic-acl "0FBFBFFF" --await --rpc-endpoint 192.168.198.183:8080
can't create API client: can't init SDK client: gRPC dial: context deadline exceeded

Context

This prevents alerts based on this metric from firing in the cases when gRPC endpoints are expected to be available but in fact are not because the interfaces they listen on are not available.

Regression

Unknown.

Your Environment

Virtual CYP.

## Expected Behavior When an interface which gRPC server listens on becomes down, frostfs_node_grpc_server_health for that interface's address becomes 0. ## Current Behavior When an interface which gRPC server listens on becomes down, frostfs_node_grpc_server_health for that interface's address remains 1 even after several days. ## Possible Solution No fix can be suggested by a QA engineer. Further solutions shall be up to developers. ## Steps to Reproduce (for bugs) 1. Make internal interfaces go down: `networkctl down internal0 internal1` 2. Wait for several minutes, then check the metric: ``` service@annikitin-node2[alone_datacenter]:~$ sudo -i ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: mgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000 link/ether 86:5b:00:b9:41:f1 brd ff:ff:ff:ff:ff:ff altname enp0s18 inet 10.78.132.183/23 metric 1024 brd 10.78.133.255 scope global mgmt valid_lft forever preferred_lft forever inet6 fe80::845b:ff:feb9:41f1/64 scope link valid_lft forever preferred_lft forever 3: data0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000 link/ether ea:1f:30:3c:65:be brd ff:ff:ff:ff:ff:ff altname enp0s19 inet 10.78.128.183/22 brd 10.78.131.255 scope global data0 valid_lft forever preferred_lft forever inet6 fe80::e81f:30ff:fe3c:65be/64 scope link valid_lft forever preferred_lft forever 4: data1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000 link/ether ce:2e:83:b6:f5:95 brd ff:ff:ff:ff:ff:ff altname enp0s20 inet 10.78.129.183/22 brd 10.78.131.255 scope global data1 valid_lft forever preferred_lft forever inet6 fe80::cc2e:83ff:feb6:f595/64 scope link valid_lft forever preferred_lft forever 5: internal0: <BROADCAST,MULTICAST> mtu 9000 qdisc pfifo_fast state DOWN group default qlen 1000 link/ether 8a:53:8c:c0:40:cf brd ff:ff:ff:ff:ff:ff altname enp0s21 6: internal1: <BROADCAST,MULTICAST> mtu 9000 qdisc pfifo_fast state DOWN group default qlen 1000 link/ether 26:b9:8b:02:da:7e brd ff:ff:ff:ff:ff:ff altname enp0s22 7: service: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP group default qlen 1000 link/ether 66:1f:8e:03:e5:ed brd ff:ff:ff:ff:ff:ff altname enp0s23 inet 192.168.254.253/29 brd 192.168.254.255 scope global service valid_lft forever preferred_lft forever inet6 fe80::641f:8eff:fe03:e5ed/64 scope link valid_lft forever preferred_lft forever 8: ens1: <BROADCAST,MULTICAST> mtu 9000 qdisc noop state DOWN group default qlen 1000 link/ether d2:c3:d6:b4:f7:86 brd ff:ff:ff:ff:ff:ff altname enp1s1 service@annikitin-node2[alone_datacenter]:~$ curl http://127.0.0.1:6672/metrics | grep frostfs_node_grpc_server_health % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP frostfs_node_grpc_server_health GRPC Server Endpoint health # TYPE frostfs_node_grpc_server_health gauge frostfs_node_grpc_server_health{endpoint="10.78.128.183:8080"} 1 frostfs_node_grpc_server_health{endpoint="10.78.128.183:8081"} 1 frostfs_node_grpc_server_health{endpoint="10.78.129.183:8080"} 1 frostfs_node_grpc_server_health{endpoint="10.78.129.183:8081"} 1 frostfs_node_grpc_server_health{endpoint="127.0.0.1:8080"} 1 frostfs_node_grpc_server_health{endpoint="192.168.198.183:8080"} 1 frostfs_node_grpc_server_health{endpoint="192.168.198.183:8081"} 1 frostfs_node_grpc_server_health{endpoint="192.168.199.183:8080"} 1 frostfs_node_grpc_server_health{endpoint="192.168.199.183:8081"} 1 100 1010k 0 1010k 0 0 47.8M 0 --:--:-- --:--:-- --:--:-- 49.3M ``` But the endpoint on 192.168.198.183 is not available: ``` service@annikitin-node1[alone_datacenter]:~$ sudo frostfs-cli --config /etc/frostfs/storage/control.yml container create --name test1 --policy "REP 2" --basic-acl "0FBFBFFF" --await --rpc-endpoint 192.168.198.183:8080 can't create API client: can't init SDK client: gRPC dial: context deadline exceeded ``` ## Context This prevents alerts based on this metric from firing in the cases when gRPC endpoints are expected to be available but in fact are not because the interfaces they listen on are not available. ## Regression Unknown. ## Your Environment Virtual CYP.

an-nikitin added the

bug

triage

labels 2024-04-22 09:55:01 +00:00

fyrchik commented

2024-04-22 10:23:01 +00:00

Owner

I want to discuss this a bit. Linux supports binding to non-local IP and bind is a separate syscall from accept, so I doubt we can receive an error when the interface goes down.

Let's check if after reassigning the IP to the service, new connections can still be accepted.

We could monitor sockets with netlink and manually close servers, but this is an entirely new feature.

I want to discuss this a bit. Linux supports [binding to non-local IP](https://sysctl-explorer.net/net/ipv4/ip_nonlocal_bind/) and `bind` is a separate syscall from `accept`, so I doubt we can receive an error when the interface goes down. Let's check if after reassigning the IP to the service, new connections can still be accepted. We could monitor sockets with netlink and manually close servers, but this is an entirely new feature.

an-nikitin commented

2024-04-22 11:24:26 +00:00

Author

Once the interfaces are up again, the endpoint seems to be working, although my command still failed, not sure if this problem is related:

service@annikitin-node1[alone_datacenter]:~$ sudo frostfs-cli --config /etc/frostfs/storage/control.yml container create --name test1 --policy "REP 2" --basic-acl "0FBFBFFF" --await --rpc-endpoint 192.168.198.183:8080
CID: 2VL96DWPX8hersn9AAbx2k8EfjDv3e5bREsCCoJLxzqm
awaiting...
timeout: container has not been persisted on sidechain

Once the interfaces are up again, the endpoint seems to be working, although my command still failed, not sure if this problem is related: ``` service@annikitin-node1[alone_datacenter]:~$ sudo frostfs-cli --config /etc/frostfs/storage/control.yml container create --name test1 --policy "REP 2" --basic-acl "0FBFBFFF" --await --rpc-endpoint 192.168.198.183:8080 CID: 2VL96DWPX8hersn9AAbx2k8EfjDv3e5bREsCCoJLxzqm awaiting... timeout: container has not been persisted on sidechain ```

fyrchik added this to the vNext milestone 2024-04-27 08:10:08 +00:00

fyrchik changed title from ~~frostfs_node_grpc_server_health metric shows 1 for addresses on interfaces which are down~~ to frostfs_node_grpc_server_health metric shows 1 for addresses on interfaces which are down

2024-05-16 08:38:45 +00:00

fyrchik added

frostfs-node

and removed

triage

labels 2024-05-16 08:38:51 +00:00