Restart all services on random nodes leads to cycling restart frostfs-ir service #241

Closed
opened 2023-04-12 13:50:44 +00:00 by anikeev-yadro · 2 comments
Member

Steps to Reproduce (for bugs)

Steps to reproduce:

  1. Ran autotests suite testsuites.failover.test_failover_services.TestFailoverServices
    See attached test log and allure report.
  2. At the end of suite on the test test_frostfs_node I saw error:
COMMAND: frostfs-cli --config /jenkins/workspace/sbercloud_tatlin_object_test/tmp.YgKknX0jg5/tatlin-object-testcases/wallet_config.yml container create --rpc-endpoint '172.26.160.7:8080' --wallet '/jenkins/workspace/sbercloud_tatlin_object_test/tmp.YgKknX0jg5/tatlin-object-testcases/TemporaryDir/49e2774c-a354-4ca4-8c44-5d6a4db3f036.json' --basic-acl '0FBFBFFF' --await --policy 'REP 1 IN X CBF 1 SELECT 1 FROM * AS X'
RETCODE: 1

STDOUT:
syncing container's settings rpc error: network info call: status: code = 1024 message = connection lost before registering response channel

STDERR:

Start / End / Elapsed	 12:09:40.694062 / 12:09:41.078162 / 0:00:00.384100
  1. And gRPC operations returned errors:
2023-04-12 15:33:32 [DEBUG] Command: frostfs-cli --config /home/bereza/src/tatlin-object-testsetup/.setup/wallets/node1-storage-config.yml netmap epoch --rpc-endpoint '172.26.160.7:8080' --wallet '/home/bereza/src/tatlin-object-testsetup/.setup/wallets/node1-storage.json'
Error:
return code: 1
output: can't create API client: can't init SDK client: gRPC dial: context deadline exceeded
  1. Restarted all nodes to resolve errors
  2. After that I saw cycling restart of frostfs-ir service on node1:
Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Scheduled restart job, restart counter is at 416.
Apr 12 13:05:11 aberezin-node1 systemd[1]: Stopped FrostFS InnerRing node.
Apr 12 13:05:11 aberezin-node1 systemd[1]: Started FrostFS InnerRing node.
Apr 12 13:05:11 aberezin-node1 frostfs-ir[16283]: could not create RPC client: WS client creation: dial tcp [::1]:40332: connect: connection refused
Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Main process exited, code=exited, status=1/FAILURE
Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Failed with result 'exit-code'.

Versions

FrostFS Storage node
Version: v0.0.1-393-g6bf11f7c
GoVersion: go1.18.4

Your Environment

Virtual

## Steps to Reproduce (for bugs) Steps to reproduce: 1. Ran autotests suite `testsuites.failover.test_failover_services.TestFailoverServices` See attached test log and allure report. 2. At the end of suite on the test `test_frostfs_node` I saw error: ``` COMMAND: frostfs-cli --config /jenkins/workspace/sbercloud_tatlin_object_test/tmp.YgKknX0jg5/tatlin-object-testcases/wallet_config.yml container create --rpc-endpoint '172.26.160.7:8080' --wallet '/jenkins/workspace/sbercloud_tatlin_object_test/tmp.YgKknX0jg5/tatlin-object-testcases/TemporaryDir/49e2774c-a354-4ca4-8c44-5d6a4db3f036.json' --basic-acl '0FBFBFFF' --await --policy 'REP 1 IN X CBF 1 SELECT 1 FROM * AS X' RETCODE: 1 STDOUT: syncing container's settings rpc error: network info call: status: code = 1024 message = connection lost before registering response channel STDERR: Start / End / Elapsed 12:09:40.694062 / 12:09:41.078162 / 0:00:00.384100 ``` 3. And gRPC operations returned errors: ``` 2023-04-12 15:33:32 [DEBUG] Command: frostfs-cli --config /home/bereza/src/tatlin-object-testsetup/.setup/wallets/node1-storage-config.yml netmap epoch --rpc-endpoint '172.26.160.7:8080' --wallet '/home/bereza/src/tatlin-object-testsetup/.setup/wallets/node1-storage.json' Error: return code: 1 output: can't create API client: can't init SDK client: gRPC dial: context deadline exceeded ``` 4. Restarted all nodes to resolve errors 5. After that I saw cycling restart of frostfs-ir service on node1: ``` Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Scheduled restart job, restart counter is at 416. Apr 12 13:05:11 aberezin-node1 systemd[1]: Stopped FrostFS InnerRing node. Apr 12 13:05:11 aberezin-node1 systemd[1]: Started FrostFS InnerRing node. Apr 12 13:05:11 aberezin-node1 frostfs-ir[16283]: could not create RPC client: WS client creation: dial tcp [::1]:40332: connect: connection refused Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Main process exited, code=exited, status=1/FAILURE Apr 12 13:05:11 aberezin-node1 systemd[1]: frostfs-ir.service: Failed with result 'exit-code'. ``` ## Versions ``` FrostFS Storage node Version: v0.0.1-393-g6bf11f7c GoVersion: go1.18.4 ``` ## Your Environment Virtual
anikeev-yadro added the
triage
label 2023-04-12 13:50:44 +00:00
Owner

frostfs-ir restarts because it cannot connect to the neo-go.
Not a bug.

frostfs-ir restarts because it cannot connect to the neo-go. Not a bug.
Owner

While it's generally not ideal for a critical service like an IR service to die, in certain cases, it can be considered acceptable. IR is designed to be stateless, meaning it doesn't store data or maintain sessions, and can easily be restarted without causing data loss or system instability, then it may be safe to let systemd or another supervisor process automatically restart the service after it crashes. However, it's important to configure the supervisor process appropriately, setting thresholds for the number of restart attempts and time delays between them to avoid resource exhaustion or other issues.

However, while automating the restart of a service can be convenient, it's still essential to investigate the root cause of the failures that caused the service to crash in the first place. Implementing monitoring and logging can provide visibility into the service's health and allow timely resolution of issues that may cause the service to fail repeatedly. Therefore, while it may be safe for some services to die and restart automatically, proactive monitoring and troubleshooting are still critical to ensure a reliable and robust system.

While it's generally not ideal for a critical service like an IR service to die, in certain cases, it can be considered acceptable. IR is designed to be stateless, meaning it doesn't store data or maintain sessions, and can easily be restarted without causing data loss or system instability, then it may be safe to let systemd or another supervisor process automatically restart the service after it crashes. However, it's important to configure the supervisor process appropriately, setting thresholds for the number of restart attempts and time delays between them to avoid resource exhaustion or other issues. However, while automating the restart of a service can be convenient, it's still essential to investigate the root cause of the failures that caused the service to crash in the first place. Implementing monitoring and logging can provide visibility into the service's health and allow timely resolution of issues that may cause the service to fail repeatedly. Therefore, while it may be safe for some services to die and restart automatically, proactive monitoring and troubleshooting are still critical to ensure a reliable and robust system.
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#241
No description provided.