Subscriber's Close() seems to work incorrectly #710

Closed
opened 2023-09-28 09:27:42 +00:00 by aarifullin · 1 comment
Member

Expected Behavior

  1. (*subscriber).Close() is synchronous (but we can discuss this point).
  2. (*Client).UnsubscribeAll() does not invoke c.client.UnsubscribeAll() while rpc switching is going on.

Current Behavior

Consider the situation when a subscriber instance could not subscribe for notifications and it is going to invoke Close() after checking the error:

sub, err := subscriber.New(context.Background(), params)
if err != nil {
	panic("failed to create a subscriber") // or return and exit; or just exit
}

if err := sub.BlockNotifications(); err != nil {
   sub.Close()                  // it is asynchronous
   panic("failed to subscribe") // or return and exit; or just exit
}

Here is a chance that panic occurs earlier than UnsubscribeAll():
sub.Close() -> c.client.Close() -> close(c.closeChan) while

func (c *Client) closeWaiter(ctx context.Context) {
	select {
	case <-ctx.Done():
	case <-c.closeChan:
	}
	_ = c.UnsubscribeAll()  // we may skip it or abrupt if do not wait for its finish
	c.close()
}

In this case neo-go will never get the unsubscribe request, but it cannot "pong" the client if it is off and unsubcribe it on its own after some time.

Also, we have a problem within _ = c.UnsubscribeAll(). If sub.BlockNotifications() happens because of broken/unstable connection then subcriber can try resubcribe for events by swithing rpc - it leads to the point the subscriber uses new established websocket connection "underhood" and c.client.UnsubscribeAll() may be sent to nowhere. That happens because c.client.UnsubscribeAll() does not check if the connection is reestablishing at this moment.

Possible Solution

For Close(): introduce closeDone channel and wait for its close (you can check #707).

For UnsubscribeAll: add some loop like for c.switchIsActive.Load() {} before this invocation

Your Environment

This can be reproduced in devenv - just need to make the such scenario

<!-- Provide a general summary of the issue in the Title above --> ## Expected Behavior <!-- If you're describing a bug, tell us what should happen If you're suggesting a change/improvement, tell us how it should work --> 1. [(*subscriber).Close()](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/subscriber/subscriber.go#L116) is synchronous (but we can discuss this point). 2. [(*Client).UnsubscribeAll()](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/notifications.go#L95) does not invoke [c.client.UnsubscribeAll()](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/notifications.go#L103) while rpc switching is going on. ## Current Behavior Consider the situation when a subscriber instance could not subscribe for notifications and it is going to invoke `Close()` after checking the error: ```golang sub, err := subscriber.New(context.Background(), params) if err != nil { panic("failed to create a subscriber") // or return and exit; or just exit } if err := sub.BlockNotifications(); err != nil { sub.Close() // it is asynchronous panic("failed to subscribe") // or return and exit; or just exit } ``` Here is a chance that panic occurs earlier than `UnsubscribeAll()`: `sub.Close()` -> `c.client.Close()` -> `close(c.closeChan)` while ```golang func (c *Client) closeWaiter(ctx context.Context) { select { case <-ctx.Done(): case <-c.closeChan: } _ = c.UnsubscribeAll() // we may skip it or abrupt if do not wait for its finish c.close() } ``` In this case neo-go will never get the unsubscribe request, but it cannot "pong" the client if it is off and unsubcribe it on its own after some time. Also, we have a problem within `_ = c.UnsubscribeAll()`. If `sub.BlockNotifications()` happens because of broken/unstable connection then subcriber can try resubcribe for events by [swithing rpc](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/subscriber/subscriber.go#L233) - it leads to the point the subscriber uses new established websocket connection "underhood" and `c.client.UnsubscribeAll()` may be sent to _nowhere_. That happens because `c.client.UnsubscribeAll()` does not check if the connection is reestablishing at this moment. ## Possible Solution For `Close()`: introduce `closeDone` channel and wait for its close (you can check [#707](https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/707/files)). For `UnsubscribeAll`: add some loop like `for c.switchIsActive.Load() {}` before this [invocation](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/notifications.go#L103) ## Your Environment This can be reproduced in devenv - just need to make the such scenario
aarifullin added the
bug
discussion
triage
labels 2023-09-28 09:27:42 +00:00
fyrchik added this to the v0.38.0 milestone 2023-10-02 12:46:26 +00:00
Author
Member

I have reconsidered the problem and made a such conslusion:

  1. I was right about this point:

In this case neo-go will never get the unsubscribe request

But neo-go really deals with that very well - it unsubscribes died subscribers. So, hanging subscriber exists for pong interval that lasts few seconds. If neo-go is full of event subscribers, the client can request for subscribption after some time (when neo-go will wipe out hanging ones)

  1. About

add some loop like for c.switchIsActive.Load() {} before this

I missed the fact that the method locks switch mutex

So, the problem is not actual

I have reconsidered the problem and made a such conslusion: 1. I was right about this point: > In this case neo-go will never get the unsubscribe request But neo-go really deals with that very well - it [unsubscribes](https://github.com/nspcc-dev/neo-go/blob/master/pkg/rpcclient/wsclient.go#L565) _died_ subscribers. So, hanging subscriber exists for pong interval that lasts few seconds. If neo-go is full of event subscribers, the client can request for subscribption after some time (when neo-go will wipe out hanging ones) 2. About > add some loop like for c.switchIsActive.Load() {} before this I missed the fact that the method locks `switch` [mutex](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/morph/client/notifications.go#L99) So, the problem is not actual
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#710
No description provided.