Introduce retry mechanism for event subscriber #685

Open
opened 2023-09-12 19:40:42 +00:00 by aarifullin · 1 comment
Member

Consider the subscriber that is used to get specific events from the blockchain. The subscriber wraps the morph client.

The subscriber is able to reconnect to the notificator endpoint if the connection has been lost/reset.

The problem is that if the length of endpoints list equals to 1: len(c.endpoints.list) == 1 then this means SwitchRPC fails if the single endpoint is unavaiable for a while. By the way, it may be fine if there are few endpoints because we have good chance to swtich to working endpoint.

That happens because the websocket client constructor does not attempt to reconnect after failure - DialTimeout for the WS-client is used for HandshakeTimeout and does not help us at all because the connection won't be established until the peer is on.

There are two ways to fix this problem:

  1. Fix the dialer for the websocket client in neo-go passing new parameters
  2. Use backoff within frostfs-node in newCli with retry count and retry interval but it is not obvious that the ws client creation can be retried after failure. But this will work out
Consider the [subscriber](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/subscriber/subscriber.go#L150) that is used to get specific events from the blockchain. The subscriber wraps the [morph client](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/constructor.go#L95). The subscriber is able to [reconnect](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/constructor.go#L95) to the notificator endpoint if the connection has been lost/reset. The problem is that if the length of endpoints list equals to `1`: `len(c.endpoints.list) == 1` then this means [SwitchRPC](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/multi.go#L33) fails if the single endpoint is unavaiable for a while. By the way, it **_may be_** fine if there are _few_ endpoints because we have good chance to swtich to working endpoint. That happens because the websocket client [constructor](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/constructor.go#L170) does not attempt to reconnect after failure - `DialTimeout` for the WS-client is used for `HandshakeTimeout` and does not help us at all because the connection won't be established until the peer is on. There are two ways to fix this problem: 1. Fix the dialer for the websocket client in neo-go passing new parameters 2. Use backoff within frostfs-node in [newCli](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/constructor.go#L167) with retry count and retry interval but it is not obvious that the ws client creation can be retried after failure. But this will work out
aarifullin added the
bug
discussion
triage
labels 2023-09-12 19:40:42 +00:00
aarifullin changed title from Introduce retry mechanism for morph client used by subscriber to Introduce retry mechanism for event subscriber 2023-09-12 19:41:11 +00:00
fyrchik added
enhancement
and removed
bug
labels 2023-09-13 06:40:03 +00:00
fyrchik added this to the v0.38.0 milestone 2023-09-13 08:52:20 +00:00
fyrchik modified the milestone from v0.38.0 to v0.39.0 2024-02-12 06:34:50 +00:00
fyrchik modified the milestone from v0.39.0 to v0.40.0 2024-05-14 14:12:29 +00:00
fyrchik modified the milestone from v0.40.0 to v0.41.0 2024-06-01 09:19:46 +00:00
fyrchik modified the milestone from v0.41.0 to v0.42.0 2024-06-14 07:06:52 +00:00
fyrchik modified the milestone from v0.42.0 to v0.43.0 2024-07-23 06:34:43 +00:00
fyrchik modified the milestone from v0.43.0 to v0.44.0 2024-09-30 11:51:35 +00:00
Owner

neo-go client shouldn't be changed, retry logic should be implemented in morph only.

neo-go client shouldn't be changed, retry logic should be implemented in morph only.
fyrchik modified the milestone from v0.44.0 to v0.45.0 2024-11-25 10:46:49 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#685
No description provided.