Make rpc client stream initialization get cancelled by dial timeout #304

Merged
fyrchik merged 1 commit from aarifullin/frostfs-sdk-go:fix/grpc_client into master 2024-12-06 09:49:45 +00:00
Member
  • c.conn may be already invalidated but the rpc client can't detect this. NewStream may hang trying to open a stream with invalidated connection. Using timer with dialTimeout for NewStream fixes this problem.

Close #301

* `c.conn` may be already invalidated but the rpc client can't detect this. `NewStream` may hang trying to open a stream with invalidated connection. Using timer with `dialTimeout` for `NewStream` fixes this problem. Close #301
aarifullin added 1 commit 2024-12-04 14:06:13 +00:00
[#301] api: Call dialer for open connections too
All checks were successful
DCO / DCO (pull_request) Successful in 3m27s
Tests and linters / Tests (pull_request) Successful in 3m43s
Tests and linters / Lint (pull_request) Successful in 4m19s
e295f3afa5
* If a client instance already sets `conn`, then `dialer` should be
  called for this `conn` anyway. `conn` may turn to invalidated state
  and thus we can check it with `dialer`.

Signed-off-by: Airat Arifullin <a.arifullin@yadro.com>
aarifullin requested review from storage-core-committers 2024-12-04 14:06:37 +00:00
aarifullin requested review from storage-core-developers 2024-12-04 14:06:41 +00:00

It turns out that an additional RPC call will always be made for existing connections. Moreover, this will happen in 99% of cases. At first glance it looks very expensive.

It turns out that an additional RPC call will always be made for existing connections. Moreover, this will happen in 99% of cases. At first glance it looks very expensive.
aarifullin force-pushed fix/grpc_client from e295f3afa5 to 720dc2a1f1 2024-12-04 15:07:00 +00:00 Compare
Author
Member

It turns out that an additional RPC call will always be made for existing connections. Moreover, this will happen in 99% of cases. At first glance it looks very expensive.

I've introduced dial timeout for NewStream

> It turns out that an additional RPC call will always be made for existing connections. Moreover, this will happen in 99% of cases. At first glance it looks very expensive. I've introduced dial timeout for `NewStream`
aarifullin requested review from storage-services-committers 2024-12-04 15:09:33 +00:00
aarifullin requested review from storage-services-developers 2024-12-04 15:09:33 +00:00
aarifullin changed title from Call dialer for open connections too to Make rpc client stream initialization get cancelled by dial timeout 2024-12-04 15:14:09 +00:00
a-savchuk reviewed 2024-12-05 06:36:33 +00:00
@ -51,3 +51,3 @@
}
ctx, cancel := context.WithCancel(prm.ctx)
ctx, cancel := context.WithTimeout(prm.ctx, c.dialTimeout)
Member

I believe, the gRPC client has one context for both dialing and streaming, and there's no separate dialing step in gRPC-Go v1.63+. Could you please double-check it?

I believe, the gRPC client has one context for both [dialing](https://github.com/grpc/grpc-go/blob/317271b232677b7869576a49855b01b9f4775d67/stream.go#L352-L367) and [streaming](https://github.com/grpc/grpc-go/blob/317271b232677b7869576a49855b01b9f4775d67/stream.go#L388-L402), and there's no separate dialing step in gRPC-Go v1.63+. Could you please double-check it?
Author
Member

You was right. I didn't have a clue that this value also cancels streaming. The approach has been changed, please, check this out

You was right. I didn't have a clue that this value also cancels streaming. The approach has been changed, please, check this out
a-savchuk marked this conversation as resolved
aarifullin force-pushed fix/grpc_client from 720dc2a1f1 to 8f2c8007fd 2024-12-05 12:54:11 +00:00 Compare
dstepanov-yadro approved these changes 2024-12-05 13:05:46 +00:00
Dismissed
Member

I like it

If we incorporate this approach, we should consider whether we still need the Dial method. We currently use it to check if the target node is alive before making a request (example). As far as I know, it was another way to cope with the lack of distinction between the dialing and streaming steps

I like it If we incorporate this approach, we should consider whether we still need the [Dial](https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/src/commit/0352b5b191ad110070043ada4f8d1f0a998bb74c/client/client.go#L82-L128) method. We currently use it to check if the target node is alive before making a request ([example](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/7df3520d486555a0211f7e37ee3e0fa9a96cf92c/pkg/network/cache/multi.go#L93)). As far as I know, it was another way to cope with the lack of distinction between the dialing and streaming steps
a-savchuk approved these changes 2024-12-05 13:16:52 +00:00
Dismissed
fyrchik reviewed 2024-12-05 13:24:17 +00:00
@ -59,0 +60,4 @@
// We use a separate timer instead of context timeout because the latter
// would propagate to all subsequent read/write operations on the opened stream,
// which is not desired for the stream's lifecycle management.
dialTimeoutTimer := time.NewTimer(c.dialTimeout)
Owner

Do we also need dialTimeoutTimer.Stop() call?

Do we also need `dialTimeoutTimer.Stop()` call?
Author
Member

Indeed! Fixed

Indeed! Fixed
@ -59,0 +61,4 @@
// would propagate to all subsequent read/write operations on the opened stream,
// which is not desired for the stream's lifecycle management.
dialTimeoutTimer := time.NewTimer(c.dialTimeout)
newStreamCh := make(chan interface{})
Owner

please, use struct{} if you do nothing with the value -- it better communicates the intent and could be optimized by compiler as zero-size type.

please, use `struct{}` if you do nothing with the value -- it better communicates the intent and could be optimized by compiler as zero-size type.
Author
Member

t better communicates the intent and could be optimized by compiler as zero-size type.

Good to know! Fixed

> t better communicates the intent and could be optimized by compiler as zero-size type. Good to know! Fixed
aarifullin force-pushed fix/grpc_client from 8f2c8007fd to 2352156ee9 2024-12-05 13:31:08 +00:00 Compare
aarifullin dismissed dstepanov-yadro's review 2024-12-05 13:31:08 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

aarifullin dismissed a-savchuk's review 2024-12-05 13:31:08 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

fyrchik reviewed 2024-12-05 13:45:08 +00:00
@ -59,0 +75,4 @@
close(newStreamCh)
}()
select {
case <-dialTimeoutTimer.C:
Owner

I think we still have a race-condition here and a possible goroutine leak.

  1. times fires
  2. stream, nil is returned from NewStream
  3. context is cancelled

It is true that stream from (2) should be closed?
To prevent this we must wait <-newStreamCh and then check the stream.

I think we still have a race-condition here and a possible goroutine leak. 1. times fires 2. `stream, nil` is returned from `NewStream` 3. context is cancelled It is true that `stream` from (2) should be closed? To prevent this we must wait `<-newStreamCh` and then check the stream.
Author
Member

I have fixed something

  1. If conn successfully/unsuccessfully opens a stream: newStreamCh is closed, select goes further and the call returns err or streamWrapper
  2. If timeout expires, we check if a stream has been successfully/unsuccessfully opened at the same time. We can't wait for the channel gets closed because then this PR doesn't make sense. If it's opened, then we process like in (1). If it's still opening, then we return an error and cancel context (see cancel())

This must work out

I have fixed something 1. If conn successfully/unsuccessfully opens a stream: `newStreamCh` is closed, select goes further and the call returns `err` or `streamWrapper` 2. If timeout expires, we check if a stream has been successfully/unsuccessfully **opened** _at the same time_. We can't wait for the channel gets closed because then this PR doesn't make sense. If it's opened, then we process like in (1). If it's still _opening_, then we return an error and cancel context (see `cancel()`) This must work out
Author
Member

It is true that stream from (2) should be closed?

I suppose we don't need to close if it's successfully/unsuccessfully opened . The goal of this fix is to prevent hung open_ing_. So, if we have the result of NewStream, then we can return either opened stream or an error

> It is true that stream from (2) should be closed? I suppose we don't need to close if it's successfully/unsuccessfully opened . The goal of this fix is to prevent hung open_ing_. So, if we have the result of `NewStream`, then we can return either opened stream or an error
Owner

I suppose

Is there any way you can transform this supposition into knowledge? Please, do.

>I suppose Is there any way you can transform this supposition into knowledge? Please, do.
Author
Member

Check the changes out, please. Now it's closed if timer has expired

Check the changes out, please. Now it's closed if timer has expired
aarifullin force-pushed fix/grpc_client from 2352156ee9 to a5a399f748 2024-12-05 14:08:46 +00:00 Compare
dkirillov reviewed 2024-12-06 06:29:51 +00:00
@ -56,3 +55,1 @@
ServerStreams: info.ServerStream(),
ClientStreams: info.ClientStream(),
}, toMethodName(info))
defer cancel()
Member

Should we save cancel as field of sttreamWrapper below if invoke it in defer?

Should we save `cancel` as field of `sttreamWrapper` below if invoke it in `defer`?
Author
Member

Sure we don't. Fixed

Sure we don't. Fixed
dkirillov marked this conversation as resolved
aarifullin force-pushed fix/grpc_client from a5a399f748 to cbc59f82fb 2024-12-06 07:23:00 +00:00 Compare
fyrchik requested changes 2024-12-06 07:29:27 +00:00
Dismissed
@ -59,0 +77,4 @@
select {
case <-dialTimeoutTimer.C:
select {
case <-newStreamCh:
Owner

You have just entered <-dialTimeoutTimer.C branch (newStreamCh could be closed or not closed)
goroutine leak can still happen in the default branch, because that's where newStreamCh

You have just entered `<-dialTimeoutTimer.C` branch (`newStreamCh` could be closed or not closed) goroutine leak can still happen in the `default` branch, because that's where `newStreamCh`
Owner

I mean the only question is whether we need to close the stream.
This inner select is useless.

I mean the only question is whether we need to close the stream. This inner select is useless.
Author
Member

Sorry. When I read your comment first time:

  • times fires
  • stream, nil is returned from NewStream
  • context is cancelled

I mistook these points for "any of this" case. So, now I understand you mean what if timer has expired, but NewStream is fine.
So, check the changes out: if stream has been opened but timer - expired, we close the stream

Sorry. When I read your comment first time: > - times fires > - stream, nil is returned from NewStream > - context is cancelled I mistook these points for "any of this" case. So, now I understand you mean what if timer has expired, but `NewStream` is fine. So, check the changes out: if stream has been opened but timer - expired, we close the stream
aarifullin force-pushed fix/grpc_client from cbc59f82fb to c3bb283cce 2024-12-06 07:47:51 +00:00 Compare
aarifullin force-pushed fix/grpc_client from c3bb283cce to 7d53887170 2024-12-06 08:15:37 +00:00 Compare
aarifullin force-pushed fix/grpc_client from 7d53887170 to e08403faba 2024-12-06 08:32:45 +00:00 Compare
fyrchik approved these changes 2024-12-06 08:34:35 +00:00
dkirillov approved these changes 2024-12-06 09:06:13 +00:00
fyrchik merged commit 81c423e709 into master 2024-12-06 09:49:45 +00:00
Sign in to join this conversation.
No reviewers
TrueCloudLab/storage-services-developers
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-sdk-go#304
No description provided.