RPC client may use invalidated/expired gRPC connection #301
Labels
No labels
P0
P1
P2
P3
good first issue
pool
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-sdk-go#301
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Expected Behavior
If c.conn is actually invalidated/expired/broken, the client initialization should get failed immediately
Current Behavior
If the RPC client is retrieved from a pool/cache and at the same time the connection has been just broken, then the client initialization reuses invalidated connection. So, the client is trying to open a stream using such "zombie" connection.
Long story short: a stream initialization deadly hangs here and the only cancellation comes from cancelled context by the client side. Thus, an RPC call (like
Search
) is waiting for the context cancellation although it could return the error immediately as the connection is invalidated.streamTimeout
parameter can't handle this as it manages only opened stream,dialTimeout
neither.Possible Solution
conn != nil
: see this, then we also should check that this connection is diable, i.e. call the dialer on this connection. I don't think we can usebecause its prototype looks barely helpful for such purpose
Context
This problem has been found out during the research of
object.search
issue. The scenario:REP 4
container with many objects. The size of objects don't matter. If we disable one of the container node and try to performobject.search
on this container, then we find an exceeded deadline error. This hang is interrupted by, for instance,--timeout 100s
and we getdeadline exceed
error.This is happening because
search
handles the incoming request asynchronously requesting other container nodes to collect object IDs. To request a container node it usesmultiClient
that uses client cache. So, a cached client for node requesting causes the hang during opening a server stream.Probably, the same problem may also cause unwanted context cancellation during
object.put
to container withEC X.Y
policy as it uses about the same approach to put encoded object chunks.Regression
Perhaps, no.
Your Environment
frostfs-dev-env
I believe this is likely to occur after TrueCloudLab/frostfs-node#1441.
Before this fix, requests would fail due to the default gRPC timeout, which was a result of gRPC's default backoff strategy. Now, the requests honestly waits for the timeout we specify, which is causing the problem.
To summarize:
Additionally, I think that when referencing code snippets, it's better to use permanent links, as regular links might become invalid later
aarifullin referenced this issue2024-12-04 14:06:13 +00:00
aarifullin referenced this issue from TrueCloudLab/frostfs-node2024-12-04 14:06:17 +00:00
frostfs-sdk-go
version #1539