Restart all services on random nodes leads to cycling restart frostfs-ir service #241

New issue

Closed

opened 2023-04-12 13:50:44 +00:00 by anikeev-yadro · 2 comments

anikeev-yadro commented

2023-04-12 13:50:44 +00:00

Member

Steps to Reproduce (for bugs)

Steps to reproduce:

Ran autotests suite testsuites.failover.test_failover_services.TestFailoverServices
See attached test log and allure report.
At the end of suite on the test test_frostfs_node I saw error:

COMMAND: frostfs-cli --config /jenkins/workspace/sbercloud_tatlin_object_test/tmp.YgKknX0jg5/tatlin-object-testcases/wallet_config.yml container create --rpc-endpoint '172.26.160.7:8080' --wallet '/jenkins/workspace/sbercloud_tatlin_object_test/tmp.YgKknX0jg5/tatlin-object-testcases/TemporaryDir/49e2774c-a354-4ca4-8c44-5d6a4db3f036.json' --basic-acl '0FBFBFFF' --await --policy 'REP 1 IN X CBF 1 SELECT 1 FROM * AS X'
RETCODE: 1

STDOUT:
syncing container's settings rpc error: network info call: status: code = 1024 message = connection lost before registering response channel

STDERR:

Start / End / Elapsed	 12:09:40.694062 / 12:09:41.078162 / 0:00:00.384100

And gRPC operations returned errors:

2023-04-12 15:33:32 [DEBUG] Command: frostfs-cli --config /home/bereza/src/tatlin-object-testsetup/.setup/wallets/node1-storage-config.yml netmap epoch --rpc-endpoint '172.26.160.7:8080' --wallet '/home/bereza/src/tatlin-object-testsetup/.setup/wallets/node1-storage.json'
Error:
return code: 1
output: can't create API client: can't init SDK client: gRPC dial: context deadline exceeded

Restarted all nodes to resolve errors
After that I saw cycling restart of frostfs-ir service on node1:

Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Scheduled restart job, restart counter is at 416.
Apr 12 13:05:11 aberezin-node1 systemd[1]: Stopped FrostFS InnerRing node.
Apr 12 13:05:11 aberezin-node1 systemd[1]: Started FrostFS InnerRing node.
Apr 12 13:05:11 aberezin-node1 frostfs-ir[16283]: could not create RPC client: WS client creation: dial tcp [::1]:40332: connect: connection refused
Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Main process exited, code=exited, status=1/FAILURE
Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Failed with result 'exit-code'.

Versions

FrostFS Storage node
Version: v0.0.1-393-g6bf11f7c
GoVersion: go1.18.4

Your Environment

Virtual

## Steps to Reproduce (for bugs) Steps to reproduce: 1. Ran autotests suite `testsuites.failover.test_failover_services.TestFailoverServices` See attached test log and allure report. 2. At the end of suite on the test `test_frostfs_node` I saw error: ``` COMMAND: frostfs-cli --config /jenkins/workspace/sbercloud_tatlin_object_test/tmp.YgKknX0jg5/tatlin-object-testcases/wallet_config.yml container create --rpc-endpoint '172.26.160.7:8080' --wallet '/jenkins/workspace/sbercloud_tatlin_object_test/tmp.YgKknX0jg5/tatlin-object-testcases/TemporaryDir/49e2774c-a354-4ca4-8c44-5d6a4db3f036.json' --basic-acl '0FBFBFFF' --await --policy 'REP 1 IN X CBF 1 SELECT 1 FROM * AS X' RETCODE: 1 STDOUT: syncing container's settings rpc error: network info call: status: code = 1024 message = connection lost before registering response channel STDERR: Start / End / Elapsed 12:09:40.694062 / 12:09:41.078162 / 0:00:00.384100 ``` 3. And gRPC operations returned errors: ``` 2023-04-12 15:33:32 [DEBUG] Command: frostfs-cli --config /home/bereza/src/tatlin-object-testsetup/.setup/wallets/node1-storage-config.yml netmap epoch --rpc-endpoint '172.26.160.7:8080' --wallet '/home/bereza/src/tatlin-object-testsetup/.setup/wallets/node1-storage.json' Error: return code: 1 output: can't create API client: can't init SDK client: gRPC dial: context deadline exceeded ``` 4. Restarted all nodes to resolve errors 5. After that I saw cycling restart of frostfs-ir service on node1: ``` Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Scheduled restart job, restart counter is at 416. Apr 12 13:05:11 aberezin-node1 systemd[1]: Stopped FrostFS InnerRing node. Apr 12 13:05:11 aberezin-node1 systemd[1]: Started FrostFS InnerRing node. Apr 12 13:05:11 aberezin-node1 frostfs-ir[16283]: could not create RPC client: WS client creation: dial tcp [::1]:40332: connect: connection refused Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Main process exited, code=exited, status=1/FAILURE Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Failed with result 'exit-code'. ``` ## Versions ``` FrostFS Storage node Version: v0.0.1-393-g6bf11f7c GoVersion: go1.18.4 ``` ## Your Environment Virtual

journal_node2.tar.gz

2.7 MiB

journal_node4.tar.gz

2.4 MiB

journal_node3.tar.gz

2.6 MiB

allure-report (5).zip

1.5 MiB

anikeev-yadro added the

triage

label 2023-04-12 13:50:44 +00:00

fyrchik commented

2023-04-13 06:34:11 +00:00

Owner

frostfs-ir restarts because it cannot connect to the neo-go.
Not a bug.

frostfs-ir restarts because it cannot connect to the neo-go. Not a bug.

anikeev-yadro closed this issue

2023-04-13 07:00:10 +00:00

realloc commented

2023-04-14 09:55:13 +00:00

Owner

However, while automating the restart of a service can be convenient, it's still essential to investigate the root cause of the failures that caused the service to crash in the first place. Implementing monitoring and logging can provide visibility into the service's health and allow timely resolution of issues that may cause the service to fail repeatedly. Therefore, while it may be safe for some services to die and restart automatically, proactive monitoring and troubleshooting are still critical to ensure a reliable and robust system.

While it's generally not ideal for a critical service like an IR service to die, in certain cases, it can be considered acceptable. IR is designed to be stateless, meaning it doesn't store data or maintain sessions, and can easily be restarted without causing data loss or system instability, then it may be safe to let systemd or another supervisor process automatically restart the service after it crashes. However, it's important to configure the supervisor process appropriately, setting thresholds for the number of restart attempts and time delays between them to avoid resource exhaustion or other issues. However, while automating the restart of a service can be convenient, it's still essential to investigate the root cause of the failures that caused the service to crash in the first place. Implementing monitoring and logging can provide visibility into the service's health and allow timely resolution of issues that may cause the service to fail repeatedly. Therefore, while it may be safe for some services to die and restart automatically, proactive monitoring and troubleshooting are still critical to ensure a reliable and robust system.