Reconnect gRPC servers #836
No reviewers
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#836
Loading…
Reference in a new issue
No description provided.
Delete branch "dstepanov-yadro/frostfs-node:feat/grpc_init_lazy"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Now gRPC servers will try to reconnect in case of unavailable endpoint.
Tested add new IP address to existed interface
On dev-env add new gRPC endpoint:
- FROSTFS_GRPC_2_ENDPOINT=100.100.100.1:8080
On startup
ip a add 100.100.100.1 dev eth0
:Close #835
00fe695c7b
toccf55bd659
ccf55bd659
to52aa926dcd
52aa926dcd
to992d125421
992d125421
todf916520aa
WIP: grpc reconnectto Reconnect gRPC servers@ -449,18 +449,79 @@ func (c *cfg) ReadCurrentNetMap(msg *netmapV2.NetMap) error {
return nil
}
type grpcConnection struct {
Is it really a connection?
renamed to server
@ -464,0 +519,4 @@
return
}
c.connections = append(c.connections[0:pos], c.connections[pos:]...)
Don't we need to cleanup all resources (it seems
listener
can be open, this will prevent us from using this IP on the next reconnection)?Also
c[:pos] + c[pos:]
seems equalc
, do we forget +-1 somewhere?grpc.Server.Serve
will close listener:But ok, added explicit
Stop
call.Fixed
@ -93,1 +51,4 @@
}
for _, endpoint := range endpointsToReconnect {
scheduleReconnect(endpoint, c)
Why have you decided to use separate goroutines to handle each reconnection?
Each address tries to reconnect independently so that waiting for one address does not affect the speed of reconnecting to the others.
@ -94,0 +67,4 @@
case <-t.C:
c.log.Info(logs.FrostFSNodeGRPCReconnecting, zap.String("endpoint", endpoint))
var success bool
grpcconfig.IterateEndpoints(c.appCfg, func(sc *grpcconfig.Config) {
Will there be a leak if we remove the endpoint from the configuration and send SIGHUP while this goroutine is running?
Also, this can be executed concurrently to SIGHUP, how can we easily see no data-race occurs?
Well, I used ostrich tactics in this case: as I see SIGHUP handling now is not threadsafe in general.
If there is no current endpoint after sighup, then the reconnect goroutine will return. Or did I misunderstand the question?
What do you mean? Do we already have places like this (non-threadsafe)? e.g. for engine it is thread safe.
That's true (and it is a bug), but
initApp
is called only during startup, there is much less probability to receive SIGHUP hereWith grpc servers constantly relistening we introduce the race under normal operation.
If it looks like being hard to add now we can do it in #855, but IMO this is a high-priority task.
Ok, fixed
We are execution
reloadConfig
only for state control.HealthStatus_READY, in other cases we are skipping reloading. Status will be ready only when allinitAndLog
calls will be compleated. Also, we have special status HealthStatus_RECONFIGURING. So if we need to reload grpc, at the first step we need to stop connection and wait for completion of the all background tasks. Looks like mutex here is redundant. This functionality already implemented on top ofinternals.healthStatus *atomic.Int32
.Thanks, my statement about sighup threadsafe is incorrect.
But the situation when node recreates gRPC server is not RECONFIGURING: node is healthy. Also gRPC reconnect should not to wait full SIGHUP.
Health status
CompareAndSwap
strategy allows to run sighup OR one gRPC reconnect. It is too strict.Mutex locks only appConfig reload and allows many goroutines to read it.
df916520aa
to1ae43f80f0
@ -1180,3 +1242,3 @@
c.ctxCancel()
c.done <- struct{}{}
close(c.done)
To have many waiters.
1ae43f80f0
toefebe91369
efebe91369
tof77afa643b
f77afa643b
tob398d6ca88
b398d6ca88
to026c2507d4
026c2507d4
tobdf3c02b94
bdf3c02b94
tofde08db5ad
@ -1122,6 +1200,8 @@ func (c *cfg) signalWatcher(ctx context.Context) {
}
func (c *cfg) reloadConfig(ctx context.Context) {
unlock := c.LockAppConfigExclusive()
Looks like mutex here is redundant. This functionality already implemented on top of
internals.healthStatus *atomic.Int32
.4e3ba1e7ec
toa86299ceae
a86299ceae
to737466f578
737466f578
to034b884088
034b884088
to61da7dca24