Introduce retry mechanism for event subscriber #685

New issue

Open

opened 2023-09-12 19:40:42 +00:00 by aarifullin · 1 comment

aarifullin commented

2023-09-12 19:40:42 +00:00

Member

Consider the subscriber that is used to get specific events from the blockchain. The subscriber wraps the morph client.

The subscriber is able to reconnect to the notificator endpoint if the connection has been lost/reset.

The problem is that if the length of endpoints list equals to 1: len(c.endpoints.list) == 1 then this means SwitchRPC fails if the single endpoint is unavaiable for a while. By the way, it may be fine if there are few endpoints because we have good chance to swtich to working endpoint.

That happens because the websocket client constructor does not attempt to reconnect after failure - DialTimeout for the WS-client is used for HandshakeTimeout and does not help us at all because the connection won't be established until the peer is on.

There are two ways to fix this problem:

Fix the dialer for the websocket client in neo-go passing new parameters
Use backoff within frostfs-node in newCli with retry count and retry interval but it is not obvious that the ws client creation can be retried after failure. But this will work out

Consider the [subscriber](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/subscriber/subscriber.go#L150) that is used to get specific events from the blockchain. The subscriber wraps the [morph client](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/constructor.go#L95). The subscriber is able to [reconnect](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/constructor.go#L95) to the notificator endpoint if the connection has been lost/reset. The problem is that if the length of endpoints list equals to `1`: `len(c.endpoints.list) == 1` then this means [SwitchRPC](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/multi.go#L33) fails if the single endpoint is unavaiable for a while. By the way, it **_may be_** fine if there are _few_ endpoints because we have good chance to swtich to working endpoint. That happens because the websocket client [constructor](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/constructor.go#L170) does not attempt to reconnect after failure - `DialTimeout` for the WS-client is used for `HandshakeTimeout` and does not help us at all because the connection won't be established until the peer is on. There are two ways to fix this problem: 1. Fix the dialer for the websocket client in neo-go passing new parameters 2. Use backoff within frostfs-node in [newCli](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/constructor.go#L167) with retry count and retry interval but it is not obvious that the ws client creation can be retried after failure. But this will work out

👍 1

aarifullin added the

bug

discussion

triage

labels 2023-09-12 19:40:42 +00:00

aarifullin changed title from ~~Introduce retry mechanism for morph client used by subscriber~~ to Introduce retry mechanism for event subscriber

2023-09-12 19:41:11 +00:00