morph: Fail if there is no events #1015

Closed
dstepanov-yadro wants to merge 2 commits from dstepanov-yadro/frostfs-node:fix/morph_reconnect into master

There was the following situation:

  1. neo-go was restarted on node 2
  2. frostfs-storage switched to neo-go, which was on node 1 and received events from neo-go on node 1 for two minutes
  3. two minutes later, frostfs-storage on node 2 decided to connect to neo-go on its node 2 as with a higher priority by timer
  4. frostfs-storage on node 2 connection to neo-go on node2 somehow partially passed (neo-go restarted 8 times with an interval of 60 seconds)
  5. as a result, we got a state when it was necessary not to receive any events

Only logs from node 2 and metrics left.

I haven't found any code issues. Also I can't reproduce this bug on virtual or hardware environment. So I decided to do such fix: if node (or IR) doesn't get eny neo-go events for 20 min, it looks like connection fail, so node fails.

There was the following situation: 1. neo-go was restarted on node 2 2. frostfs-storage switched to neo-go, which was on node 1 and received events from neo-go on node 1 for two minutes 3. two minutes later, frostfs-storage on node 2 decided to connect to neo-go on its node 2 as with a higher priority by timer 4. frostfs-storage on node 2 connection to neo-go on node2 somehow partially passed (neo-go restarted 8 times with an interval of 60 seconds) 5. as a result, we got a state when it was necessary not to receive any events Only logs from node 2 and metrics left. I haven't found any code issues. Also I can't reproduce this bug on virtual or hardware environment. So I decided to do such fix: if node (or IR) doesn't get eny neo-go events for 20 min, it looks like connection fail, so node fails.
dstepanov-yadro force-pushed fix/morph_reconnect from b114542e9f to bb78b96830 2024-03-01 07:41:00 +00:00 Compare
dstepanov-yadro added 1 commit 2024-03-01 07:57:34 +00:00
[#1015] morph: Resolve funlen linter
All checks were successful
Vulncheck / Vulncheck (pull_request) Successful in 2m58s
DCO action / DCO (pull_request) Successful in 2m46s
Build / Build Components (1.21) (pull_request) Successful in 3m50s
Build / Build Components (1.20) (pull_request) Successful in 3m59s
Tests and linters / Staticcheck (pull_request) Successful in 5m32s
Tests and linters / Lint (pull_request) Successful in 5m58s
Tests and linters / Tests (1.20) (pull_request) Successful in 8m17s
Tests and linters / Tests with -race (pull_request) Successful in 8m34s
Tests and linters / Tests (1.21) (pull_request) Successful in 8m47s
35bdb69b78
Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
fyrchik requested changes 2024-03-01 08:03:30 +00:00
fyrchik left a comment
Owner

We can't just stop, no events is an expected situation in case of split brain.
WebSocket support some form of healthchecks https://www.rfc-editor.org/rfc/rfc6455#section-5.5.2, can we try using sth like this here?

We can't just stop, no events is an expected situation in case of split brain. WebSocket support some form of healthchecks https://www.rfc-editor.org/rfc/rfc6455#section-5.5.2, can we try using sth like this here?
Author
Member

wrong fix

wrong fix
dstepanov-yadro closed this pull request 2024-03-01 09:38:47 +00:00
All checks were successful
Vulncheck / Vulncheck (pull_request) Successful in 2m58s
Required
Details
DCO action / DCO (pull_request) Successful in 2m46s
Required
Details
Build / Build Components (1.21) (pull_request) Successful in 3m50s
Required
Details
Build / Build Components (1.20) (pull_request) Successful in 3m59s
Required
Details
Tests and linters / Staticcheck (pull_request) Successful in 5m32s
Required
Details
Tests and linters / Lint (pull_request) Successful in 5m58s
Required
Details
Tests and linters / Tests (1.20) (pull_request) Successful in 8m17s
Required
Details
Tests and linters / Tests with -race (pull_request) Successful in 8m34s
Required
Details
Tests and linters / Tests (1.21) (pull_request) Successful in 8m47s
Required
Details

Pull request closed

Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1015
No description provided.