RPC client may use invalidated/expired gRPC connection

aarifullin commented

2024-12-03 14:24:41 +00:00

Member

Expected Behavior

If c.conn is actually invalidated/expired/broken, the client initialization should get failed immediately

Current Behavior

If the RPC client is retrieved from a pool/cache and at the same time the connection has been just broken, then the client initialization reuses invalidated connection. So, the client is trying to open a stream using such "zombie" connection.

Long story short: a stream initialization deadly hangs here and the only cancellation comes from cancelled context by the client side. Thus, an RPC call (like Search) is waiting for the context cancellation although it could return the error immediately as the connection is invalidated.
streamTimeout parameter can't handle this as it manages only opened stream, dialTimeout neither.

Possible Solution

Light prefetching request: if conn != nil: see this, then we also should check that this connection is diable, i.e. call the dialer on this connection. I don't think we can use

dialer func(ctx context.Context, cc grpcstd.ClientConnInterface) error

because its prototype looks barely helpful for such purpose

// ClientConnInterface defines the functions clients need to perform unary and
// streaming RPCs.  It is implemented by *ClientConn, and is only intended to
// be referenced by generated code.
type ClientConnInterface interface {
	// Invoke performs a unary RPC and returns after the response is received
	// into reply.
	Invoke(ctx context.Context, method string, args any, reply any, opts ...CallOption) error
	// NewStream begins a streaming RPC.
	NewStream(ctx context.Context, desc *StreamDesc, method string, opts ...CallOption) (ClientStream, error)
}

The server side should launch the healthcheck service and the client should check if the server is healthy

Context

This problem has been found out during the research of object.search issue. The scenario: REP 4 container with many objects. The size of objects don't matter. If we disable one of the container node and try to perform object.search on this container, then we find an exceeded deadline error. This hang is interrupted by, for instance, --timeout 100s and we get deadline exceed error.
This is happening because search handles the incoming request asynchronously requesting other container nodes to collect object IDs. To request a container node it uses multiClient that uses client cache. So, a cached client for node requesting causes the hang during opening a server stream.

Probably, the same problem may also cause unwanted context cancellation during object.put to container with EC X.Y policy as it uses about the same approach to put encoded object chunks.

Regression

Perhaps, no.

Your Environment

frostfs-dev-env

## Expected Behavior If [c.conn](https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/src/branch/master/api/rpc/client/connect.go#L16) is actually invalidated/expired/broken, the client [initialization](https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/src/branch/master/api/rpc/client/init.go#L42) should get failed immediately ## Current Behavior **If** the RPC client is retrieved from a pool/cache _and_ at the same time the connection has been just broken, **then** the client initialization [reuses](https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/src/branch/master/api/rpc/client/init.go#L49) invalidated connection. So, the client is trying to open a stream using such "zombie" connection. Long story short: a stream initialization deadly hangs [here](https://github.com/grpc/grpc-go/blob/master/picker_wrapper.go#L120) and the only cancellation comes from cancelled context by the client side. Thus, an RPC call (like `Search`) is waiting for the context cancellation although it could return the error immediately as the connection is invalidated. `streamTimeout` parameter can't handle this as it manages only opened stream, `dialTimeout` neither. ## Possible Solution 1. Light prefetching request: if `conn != nil`: [see this](https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/src/branch/master/api/rpc/client/connect.go#L16), then we also should check that this connection is _diable_, i.e. call the dialer on this connection. I don't think we can use ```go dialer func(ctx context.Context, cc grpcstd.ClientConnInterface) error ``` because its prototype looks barely helpful for such purpose ```go // ClientConnInterface defines the functions clients need to perform unary and // streaming RPCs. It is implemented by *ClientConn, and is only intended to // be referenced by generated code. type ClientConnInterface interface { // Invoke performs a unary RPC and returns after the response is received // into reply. Invoke(ctx context.Context, method string, args any, reply any, opts ...CallOption) error // NewStream begins a streaming RPC. NewStream(ctx context.Context, desc *StreamDesc, method string, opts ...CallOption) (ClientStream, error) } ``` 2. The server side should launch the [healthcheck service](https://github.com/grpc/grpc/blob/master/doc/health-checking.md) and the client should check if the server is healthy ## Context This problem has been found out during the research of `object.search` issue. The scenario: `REP 4` container with many objects. The size of objects don't matter. If we disable one of the container node and try to perform `object.search` on this container, then we find an exceeded deadline error. This hang is interrupted by, for instance, `--timeout 100s` and we get `deadline exceed` error. This is happening because `search` handles the incoming request asynchronously requesting other container nodes to collect object IDs. To request a container node it uses `multiClient` that uses client cache. So, a cached client for node requesting causes the hang during opening a server stream. Probably, the same problem may also cause unwanted context cancellation during `object.put` to container with `EC X.Y` policy as it uses about the same approach to put encoded object chunks. ## Regression Perhaps, no. ## Your Environment `frostfs-dev-env`

aarifullin added the

bug

discussion

labels 2024-12-03 14:25:01 +00:00

aarifullin was assigned by fyrchik

2024-12-04 06:05:25 +00:00

a-savchuk commented

2024-12-04 08:10:45 +00:00

Member

I believe this is likely to occur after TrueCloudLab/frostfs-node#1441.

Before this fix, requests would fail due to the default gRPC timeout, which was a result of gRPC's default backoff strategy. Now, the requests honestly waits for the timeout we specify, which is causing the problem.

To summarize:

Before the fix, a client didn't behave as expected, i. e. waited for another timeout.
After the fix, a client waits for the timeout we specify, which is the desired behavior.
Now it's the problem because we use cached connections, which may become "zombie" connections.

I believe this is likely to occur after https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/1441. Before [this fix](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/281d65435e05c6b6b37224c46c755a939b486f99/pkg/network/cache/multi.go#L73), requests would fail due to the default gRPC timeout, which was a result of gRPC's default backoff strategy. Now, the requests honestly waits for the timeout we specify, which is causing the problem. To summarize: - Before the fix, a client didn't behave as expected, i. e. waited for another timeout. - After the fix, a client waits for the timeout we specify, which is the desired behavior. - Now it's the problem because we use cached connections, which may become "zombie" connections.

a-savchuk commented

2024-12-04 08:19:52 +00:00

Member

Additionally, I think that when referencing code snippets, it's better to use permanent links, as regular links might become invalid later

👍 1

~~aarifullin referenced this issue 2024-12-04 14:06:13 +00:00~~

Make rpc client stream initialization get cancelled by dial timeout #304

~~aarifullin referenced this issue from TrueCloudLab/frostfs-node 2024-12-04 14:06:17 +00:00~~

Bump frostfs-sdk-go version #1539

aarifullin referenced this issue from a pull request that will close it,

2024-12-04 15:07:49 +00:00

Make rpc client stream initialization get cancelled by dial timeout #304