Control service doesn't start with only local interface is UP #963

Open
opened 2024-02-06 11:51:13 +00:00 by anikeev-yadro · 1 comment

Expected Behavior

Control service should start with only local interface is UP. As example, for run local commands and show healthcheck.

Current Behavior

Control service doesn't start with only local interface is UP.

Steps to Reproduce (for bugs)

Node has mgmt, internal0/1, data0/1 interfaces.

  1. Down internal0/1, data0/1
  2. Restart frostfs-storage
  3. See that service started and logged bind errors to internal0/1, data0/1
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.058Z        info        frostfs-node/main.go:76        initializing gRPC service...
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.059Z        error        frostfs-node/grpc.go:33        can't listen gRPC endpoint        {"error": "listen tcp 192.168.201.36:8080: bind: cannot assign requested address"}
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.059Z        error        frostfs-node/grpc.go:33        can't listen gRPC endpoint        {"error": "listen tcp 192.168.200.36:8080: bind: cannot assign requested address"}
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.059Z        error        frostfs-node/grpc.go:33        can't listen gRPC endpoint        {"error": "listen tcp 10.78.130.36:8080: bind: cannot assign requested address"}
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.059Z        error        frostfs-node/grpc.go:33        can't listen gRPC endpoint        {"error": "listen tcp 10.78.131.36:8080: bind: cannot assign requested address"}
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.061Z        error        frostfs-node/grpc.go:33        can't listen gRPC endpoint        {"error": "listen tcp 192.168.201.36:8081: bind: cannot assign requested address"}
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.062Z        error        frostfs-node/grpc.go:33        can't listen gRPC endpoint        {"error": "listen tcp 192.168.200.36:8081: bind: cannot assign requested address"}
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.064Z        error        frostfs-node/grpc.go:33        can't listen gRPC endpoint        {"error": "listen tcp 10.78.130.36:8081: bind: cannot assign requested address"}
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.067Z        error        frostfs-node/grpc.go:33        can't listen gRPC endpoint        {"error": "listen tcp 10.78.131.36:8081: bind: cannot assign requested address"}
фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.067Z        info        frostfs-node/main.go:78        gRPC service has been successfully initialized

Then binded on local interface

фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.116Z        info        frostfs-node/grpc.go:190        start listening endpoint        {"service": "gRPC", "endpoint": "127.0.0.1:8080"}
  1. Control service doesn't available
root@aanikeev-node1:~# sudo frostfs-cli --endpoint 127.0.0.1:8091 -w /etc/frostfs/storage/wallet.json --config /tmp/storage.yaml control shards list
can't create API client: can't init SDK client: gRPC dial: context deadline exceeded
  1. Start data1 and see that node reconnected to it
ев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.070Z        info        frostfs-node/grpc.go:104        gRPC server reconnected successfully        {"endpoint": "10.78.131.36:8081"}
фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.070Z        info        frostfs-node/grpc.go:79        reconnecting gRPC server...        {"endpoint": "10.78.130.36:8081"}
фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.067Z        info        frostfs-node/grpc.go:79        reconnecting gRPC server...        {"endpoint": "10.78.131.36:8080"}
фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.072Z        info        frostfs-node/grpc.go:190        start listening endpoint        {"service": "gRPC", "endpoint": "10.78.131.36:8081"}
фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.072Z        error        frostfs-node/grpc.go:90        can't listen gRPC endpoint        {"error": "listen tcp 192.168.201.36:8081: bind: cannot assign requested address"}
фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.072Z        warn        frostfs-node/grpc.go:91        failed to reconnect gRPC server        {"next_try_in": 60}
фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.072Z        info        frostfs-node/grpc.go:104        gRPC server reconnected successfully        {"endpoint": "10.78.131.36:8080"}
root@aanikeev-node1:~# curl -s localhost:6672 | grep endpoint
frostfs_node_grpc_server_health{endpoint="10.78.130.36:8080"} 0
frostfs_node_grpc_server_health{endpoint="10.78.130.36:8081"} 0
frostfs_node_grpc_server_health{endpoint="10.78.131.36:8080"} 1
frostfs_node_grpc_server_health{endpoint="10.78.131.36:8081"} 1
frostfs_node_grpc_server_health{endpoint="127.0.0.1:8080"} 1
frostfs_node_grpc_server_health{endpoint="192.168.200.36:8080"} 0
frostfs_node_grpc_server_health{endpoint="192.168.200.36:8081"} 0
frostfs_node_grpc_server_health{endpoint="192.168.201.36:8080"} 0
frostfs_node_grpc_server_health{endpoint="192.168.201.36:8081"} 0
  1. Control service keep doesn't available

Regression

No

Version

FrostFS Storage node
Version: v0.37.0-rc.1-244-gd0eadf7e
GoVersion: go1.20.1

Your Environment

Cloud
4 nodes

## Expected Behavior Control service should start with only local interface is UP. As example, for run local commands and show healthcheck. ## Current Behavior Control service doesn't start with only local interface is UP. ## Steps to Reproduce (for bugs) Node has mgmt, internal0/1, data0/1 interfaces. 1. Down internal0/1, data0/1 2. Restart frostfs-storage 3. See that service started and logged bind errors to internal0/1, data0/1 ``` фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.058Z info frostfs-node/main.go:76 initializing gRPC service... фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.059Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 192.168.201.36:8080: bind: cannot assign requested address"} фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.059Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 192.168.200.36:8080: bind: cannot assign requested address"} фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.059Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 10.78.130.36:8080: bind: cannot assign requested address"} фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.059Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 10.78.131.36:8080: bind: cannot assign requested address"} фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.061Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 192.168.201.36:8081: bind: cannot assign requested address"} фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.062Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 192.168.200.36:8081: bind: cannot assign requested address"} фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.064Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 10.78.130.36:8081: bind: cannot assign requested address"} фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.067Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 10.78.131.36:8081: bind: cannot assign requested address"} фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.067Z info frostfs-node/main.go:78 gRPC service has been successfully initialized ``` Then binded on local interface ``` фев 06 10:41:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:41:23.116Z info frostfs-node/grpc.go:190 start listening endpoint {"service": "gRPC", "endpoint": "127.0.0.1:8080"} ``` 4. Control service doesn't available ``` root@aanikeev-node1:~# sudo frostfs-cli --endpoint 127.0.0.1:8091 -w /etc/frostfs/storage/wallet.json --config /tmp/storage.yaml control shards list can't create API client: can't init SDK client: gRPC dial: context deadline exceeded ``` 5. Start data1 and see that node reconnected to it ``` ев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.070Z info frostfs-node/grpc.go:104 gRPC server reconnected successfully {"endpoint": "10.78.131.36:8081"} фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.070Z info frostfs-node/grpc.go:79 reconnecting gRPC server... {"endpoint": "10.78.130.36:8081"} фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.067Z info frostfs-node/grpc.go:79 reconnecting gRPC server... {"endpoint": "10.78.131.36:8080"} фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.072Z info frostfs-node/grpc.go:190 start listening endpoint {"service": "gRPC", "endpoint": "10.78.131.36:8081"} фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.072Z error frostfs-node/grpc.go:90 can't listen gRPC endpoint {"error": "listen tcp 192.168.201.36:8081: bind: cannot assign requested address"} фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.072Z warn frostfs-node/grpc.go:91 failed to reconnect gRPC server {"next_try_in": 60} фев 06 10:57:23 aanikeev-node1 frostfs-node[1648909]: 2024-02-06T10:57:23.072Z info frostfs-node/grpc.go:104 gRPC server reconnected successfully {"endpoint": "10.78.131.36:8080"} ``` ``` root@aanikeev-node1:~# curl -s localhost:6672 | grep endpoint frostfs_node_grpc_server_health{endpoint="10.78.130.36:8080"} 0 frostfs_node_grpc_server_health{endpoint="10.78.130.36:8081"} 0 frostfs_node_grpc_server_health{endpoint="10.78.131.36:8080"} 1 frostfs_node_grpc_server_health{endpoint="10.78.131.36:8081"} 1 frostfs_node_grpc_server_health{endpoint="127.0.0.1:8080"} 1 frostfs_node_grpc_server_health{endpoint="192.168.200.36:8080"} 0 frostfs_node_grpc_server_health{endpoint="192.168.200.36:8081"} 0 frostfs_node_grpc_server_health{endpoint="192.168.201.36:8080"} 0 frostfs_node_grpc_server_health{endpoint="192.168.201.36:8081"} 0 ``` 6. Control service keep doesn't available ## Regression No ## Version ``` FrostFS Storage node Version: v0.37.0-rc.1-244-gd0eadf7e GoVersion: go1.20.1 ``` ## Your Environment Cloud 4 nodes
108 KiB
anikeev-yadro added the
bug
triage
labels 2024-02-06 11:51:13 +00:00
fyrchik added the
frostfs-node
label 2024-02-07 16:13:52 +00:00
fyrchik added this to the v0.38.0 milestone 2024-02-07 16:13:54 +00:00
aarifullin was assigned by fyrchik 2024-02-12 06:26:30 +00:00
Collaborator

How do I reproduce the bug?

  1. I shut down interfaces
sudo -i ip link set data0 down
sudo -i ip link set data1 down
sudo -i ip link set internal0 down
sudo -i ip link set internal1 down
  1. Restart frostfs-storage
sudo systemctl restart frostfs-storage
  1. Try to
sudo frostfs-cli --endpoint 127.0.0.1:8091 -w /etc/frostfs/storage/wallet.json --config /tmp/storage.yaml control shards list`
  1. Get can't create API client: can't init SDK client: gRPC dial: context deadline exceeded

Potential problems that cause the bug

The control server create a tcp-socket 127.0.0.1:8091 but, actually, gRPC service does not bind with the socket yet, despite a storage node logs control service has been successfully initialized. It makes it asynchronously.

So, when it tries to make and wait for notary deposit, it hangs here. Other jobs cannot be completed because of this hang -> control's gRPC server does not bind with 127.0.0.1:8091.

I suppose the hang is happening, because local neo-go instance gets isolated from others after interfaces are shut down:

WARN        peer disconnected        {"addr": "192.168.201.204:40333", "error": "ping/pong timeout", "peerCount": 0}
WARN        peer disconnected        {"addr": "192.168.200.182:40333", "error": "ping/pong timeout", "peerCount": 1}
WARN        peer disconnected        {"addr": "192.168.200.184:40333", "error": "ping/pong timeout", "peerCount": 2}

Wait sees that like no blocks are produced.

Possible solution

neo-go should be also restarted to get it failed. The failure must cause frostfs-storage to switch RPC to available endpoint and that could fix the problem. But with shutdown data*, internal* interfaces switch is barely possible:

could not create RPC client for endpoint        {"error": "WS client creation: dial tcp 192.168.201.184:40332: i/o timeout", "endpoint": "ws://192.168.201.184:40332/ws"}

Here a client is trying to iterate over other endpoints

### How do I reproduce the bug? 1. I shut down interfaces ```bash sudo -i ip link set data0 down sudo -i ip link set data1 down sudo -i ip link set internal0 down sudo -i ip link set internal1 down ``` 2. Restart `frostfs-storage` ```bash sudo systemctl restart frostfs-storage ``` 3. Try to ``` sudo frostfs-cli --endpoint 127.0.0.1:8091 -w /etc/frostfs/storage/wallet.json --config /tmp/storage.yaml control shards list` ``` 4. Get `can't create API client: can't init SDK client: gRPC dial: context deadline exceeded` ### Potential problems that cause the bug The control server create a tcp-socket `127.0.0.1:8091` but, actually, gRPC service does not bind with the socket yet, despite a storage node logs `control service has been successfully initialized`. It makes it asynchronously. So, when it tries to make and wait for notary deposit, it hangs [here](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/client.go#L371-L411). Other jobs cannot be completed because of this hang -> control's gRPC server does not bind with `127.0.0.1:8091`. I suppose the hang is happening, because local neo-go instance gets isolated from others after interfaces are shut down: ``` WARN peer disconnected {"addr": "192.168.201.204:40333", "error": "ping/pong timeout", "peerCount": 0} WARN peer disconnected {"addr": "192.168.200.182:40333", "error": "ping/pong timeout", "peerCount": 1} WARN peer disconnected {"addr": "192.168.200.184:40333", "error": "ping/pong timeout", "peerCount": 2} ``` `Wait` sees that like no blocks are produced. ### Possible solution `neo-go` should be also restarted to get it failed. The failure must cause frostfs-storage to switch RPC to available endpoint and that could fix the problem. But with shutdown data*, internal* interfaces switch is barely possible: ``` could not create RPC client for endpoint {"error": "WS client creation: dial tcp 192.168.201.184:40332: i/o timeout", "endpoint": "ws://192.168.201.184:40332/ws"} ``` Here a client is trying to iterate over other endpoints
fyrchik modified the milestone from v0.38.0 to v0.39.0 2024-03-29 12:28:15 +00:00
fyrchik modified the milestone from v0.39.0 to v0.38.0 2024-03-29 12:28:35 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#963
There is no content yet.