Reconnect gRPC servers #836

Merged
fyrchik merged 2 commits from dstepanov-yadro/frostfs-node:feat/grpc_init_lazy into master 2023-12-18 07:46:09 +00:00

Now gRPC servers will try to reconnect in case of unavailable endpoint.

Tested add new IP address to existed interface

  1. On dev-env add new gRPC endpoint: - FROSTFS_GRPC_2_ENDPOINT=100.100.100.1:8080

  2. On startup

2023-12-04T08:31:49.403Z        info    frostfs-node/main.go:75 initializing gRPC service...
2023-12-04T08:31:49.404Z        error   frostfs-node/grpc.go:33 can't listen gRPC endpoint      {"error": "listen tcp 100.100.100.1:8080: bind: cannot assign requested address"}
  1. After ip a add 100.100.100.1 dev eth0:
2023-12-04T08:32:49.404Z        warn    frostfs-node/grpc.go:100        failed to reconnect gRPC server {"next_try_in": 60}
2023-12-04T08:33:49.404Z        info    frostfs-node/grpc.go:68 reconnecting gRPC server...     {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/netmap.go:168      register object service {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/accounting.go:35   register accounting service     {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/container.go:43    register container service      {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/session.go:64      register session service        {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/object.go:212      register object service {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/tree.go:69 register tree service   {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.405Z        info    frostfs-node/grpc.go:97 gRPC server reconnected successfully    {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.405Z        info    frostfs-node/grpc.go:170        start listening endpoint        {"service": "gRPC", "endpoint": "100.100.100.1:8080"}

Close #835

Now gRPC servers will try to reconnect in case of unavailable endpoint. Tested add new IP address to existed interface 1. On dev-env add new gRPC endpoint: `- FROSTFS_GRPC_2_ENDPOINT=100.100.100.1:8080` 2. On startup ``` 2023-12-04T08:31:49.403Z info frostfs-node/main.go:75 initializing gRPC service... 2023-12-04T08:31:49.404Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 100.100.100.1:8080: bind: cannot assign requested address"} ``` 3. After `ip a add 100.100.100.1 dev eth0`: ``` 2023-12-04T08:32:49.404Z warn frostfs-node/grpc.go:100 failed to reconnect gRPC server {"next_try_in": 60} 2023-12-04T08:33:49.404Z info frostfs-node/grpc.go:68 reconnecting gRPC server... {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/netmap.go:168 register object service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/accounting.go:35 register accounting service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/container.go:43 register container service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/session.go:64 register session service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/object.go:212 register object service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/tree.go:69 register tree service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.405Z info frostfs-node/grpc.go:97 gRPC server reconnected successfully {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.405Z info frostfs-node/grpc.go:170 start listening endpoint {"service": "gRPC", "endpoint": "100.100.100.1:8080"} ``` Close #835
dstepanov-yadro force-pushed feat/grpc_init_lazy from 00fe695c7b to ccf55bd659 2023-12-01 09:55:03 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from ccf55bd659 to 52aa926dcd 2023-12-01 11:05:02 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from 52aa926dcd to 992d125421 2023-12-01 12:40:27 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from 992d125421 to df916520aa 2023-12-01 13:05:41 +00:00 Compare
dstepanov-yadro changed title from WIP: grpc reconnect to Reconnect gRPC servers 2023-12-01 13:10:34 +00:00
dstepanov-yadro requested review from storage-core-committers 2023-12-01 13:10:41 +00:00
dstepanov-yadro requested review from storage-core-developers 2023-12-01 13:10:43 +00:00
acid-ant approved these changes 2023-12-01 13:41:34 +00:00
fyrchik reviewed 2023-12-07 14:45:04 +00:00
@ -449,18 +449,79 @@ func (c *cfg) ReadCurrentNetMap(msg *netmapV2.NetMap) error {
return nil
}
type grpcConnection struct {

Is it really a connection?

Is it really a _connection_?
Poster
Collaborator

renamed to server

renamed to server
fyrchik marked this conversation as resolved
@ -464,0 +519,4 @@
return
}
c.connections = append(c.connections[0:pos], c.connections[pos:]...)

Don't we need to cleanup all resources (it seems listener can be open, this will prevent us from using this IP on the next reconnection)?

Don't we need to cleanup all resources (it seems `listener` can be open, this will prevent us from using this IP on the next reconnection)?

Also c[:pos] + c[pos:] seems equal c, do we forget +-1 somewhere?

Also `c[:pos] + c[pos:]` seems equal `c`, do we forget +-1 somewhere?
Poster
Collaborator

Don't we need to cleanup all resources (it seems listener can be open, this will prevent us from using this IP on the next reconnection)?

grpc.Server.Serve will close listener:

func (s *Server) Serve(lis net.Listener) error {
...
	ls := &listenSocket{Listener: lis}
	s.lis[ls] = true

	defer func() {
		s.mu.Lock()
		if s.lis != nil && s.lis[ls] {
			ls.Close() <-- closes listener
			delete(s.lis, ls)
		}
		s.mu.Unlock()
	}()

But ok, added explicit Stop call.

> Don't we need to cleanup all resources (it seems `listener` can be open, this will prevent us from using this IP on the next reconnection)? `grpc.Server.Serve` will close listener: ``` func (s *Server) Serve(lis net.Listener) error { ... ls := &listenSocket{Listener: lis} s.lis[ls] = true defer func() { s.mu.Lock() if s.lis != nil && s.lis[ls] { ls.Close() <-- closes listener delete(s.lis, ls) } s.mu.Unlock() }() ``` But ok, added explicit `Stop` call.
Poster
Collaborator

Also c[:pos] + c[pos:] seems equal c, do we forget +-1 somewhere?

Fixed

> Also `c[:pos] + c[pos:]` seems equal `c`, do we forget +-1 somewhere? Fixed
fyrchik marked this conversation as resolved
@ -93,1 +51,4 @@
}
for _, endpoint := range endpointsToReconnect {
scheduleReconnect(endpoint, c)

Why have you decided to use separate goroutines to handle each reconnection?

Why have you decided to use separate goroutines to handle each reconnection?
Poster
Collaborator

Each address tries to reconnect independently so that waiting for one address does not affect the speed of reconnecting to the others.

Each address tries to reconnect independently so that waiting for one address does not affect the speed of reconnecting to the others.
fyrchik marked this conversation as resolved
@ -94,0 +67,4 @@
case <-t.C:
c.log.Info(logs.FrostFSNodeGRPCReconnecting, zap.String("endpoint", endpoint))
var success bool
grpcconfig.IterateEndpoints(c.appCfg, func(sc *grpcconfig.Config) {

Will there be a leak if we remove the endpoint from the configuration and send SIGHUP while this goroutine is running?

Will there be a leak if we remove the endpoint from the configuration and send SIGHUP while this goroutine is running?

Also, this can be executed concurrently to SIGHUP, how can we easily see no data-race occurs?

Also, this can be executed concurrently to SIGHUP, how can we easily see no data-race occurs?
Poster
Collaborator

Well, I used ostrich tactics in this case: as I see SIGHUP handling now is not threadsafe in general.

Well, I used ostrich tactics in this case: as I see SIGHUP handling now is not threadsafe in general.
Poster
Collaborator

Will there be a leak if we remove the endpoint from the configuration and send SIGHUP while this goroutine is running?

If there is no current endpoint after sighup, then the reconnect goroutine will return. Or did I misunderstand the question?

> Will there be a leak if we remove the endpoint from the configuration and send SIGHUP while this goroutine is running? If there is no current endpoint after sighup, then the reconnect goroutine will return. Or did I misunderstand the question?

as I see SIGHUP handling now is not threadsafe in general

What do you mean? Do we already have places like this (non-threadsafe)? e.g. for engine it is thread safe.

>as I see SIGHUP handling now is not threadsafe in general What do you mean? Do we already have places like this (non-threadsafe)? e.g. for engine it is thread safe.
Poster
Collaborator
func initApp(ctx context.Context, c *cfg) {
	c.wg.Add(1)
	go func() {
		c.signalWatcher(ctx) <-- here app starts to accept signals and it is possible that config will be updated concurrently with initAndLog calls
		c.wg.Done()
	}()

	initAndLog(c, ...)
	initAndLog(c, ...)
	initAndLog(c, ...)
	initAndLog(c, ...)
``` func initApp(ctx context.Context, c *cfg) { c.wg.Add(1) go func() { c.signalWatcher(ctx) <-- here app starts to accept signals and it is possible that config will be updated concurrently with initAndLog calls c.wg.Done() }() initAndLog(c, ...) initAndLog(c, ...) initAndLog(c, ...) initAndLog(c, ...) ```

That's true (and it is a bug), but initApp is called only during startup, there is much less probability to receive SIGHUP here
With grpc servers constantly relistening we introduce the race under normal operation.

That's true (and it is a bug), but `initApp` is called only during startup, there is much less probability to receive SIGHUP here With grpc servers constantly relistening we introduce the race under normal operation.

If it looks like being hard to add now we can do it in #855, but IMO this is a high-priority task.

If it looks like being hard to add now we can do it in https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/855, but IMO this is a high-priority task.
Poster
Collaborator

Ok, fixed

Ok, fixed
Collaborator

We are execution reloadConfig only for state control.HealthStatus_READY, in other cases we are skipping reloading. Status will be ready only when all initAndLog calls will be compleated. Also, we have special status HealthStatus_RECONFIGURING. So if we need to reload grpc, at the first step we need to stop connection and wait for completion of the all background tasks. Looks like mutex here is redundant. This functionality already implemented on top of internals.healthStatus *atomic.Int32.

We are execution `reloadConfig` only for state control.HealthStatus_READY, in other cases we are skipping reloading. Status will be ready only when all `initAndLog` calls will be compleated. Also, we have special status HealthStatus_RECONFIGURING. So if we need to reload grpc, at the first step we need to stop connection and wait for completion of the all background tasks. Looks like mutex here is redundant. This functionality already implemented on top of `internals.healthStatus *atomic.Int32`.
Poster
Collaborator

Thanks, my statement about sighup threadsafe is incorrect.

But the situation when node recreates gRPC server is not RECONFIGURING: node is healthy. Also gRPC reconnect should not to wait full SIGHUP.

Health status CompareAndSwap strategy allows to run sighup OR one gRPC reconnect. It is too strict.

Mutex locks only appConfig reload and allows many goroutines to read it.

Thanks, my statement about sighup threadsafe is incorrect. But the situation when node recreates gRPC server is not RECONFIGURING: node is healthy. Also gRPC reconnect should not to wait full SIGHUP. Health status `CompareAndSwap` strategy allows to run sighup OR one gRPC reconnect. It is too strict. Mutex locks only appConfig reload and allows many goroutines to read it.
dstepanov-yadro force-pushed feat/grpc_init_lazy from df916520aa to 1ae43f80f0 2023-12-07 16:48:16 +00:00 Compare
dstepanov-yadro reviewed 2023-12-07 16:58:27 +00:00
@ -1180,3 +1242,3 @@
c.ctxCancel()
c.done <- struct{}{}
close(c.done)
Poster
Collaborator

To have many waiters.

To have many waiters.
dstepanov-yadro force-pushed feat/grpc_init_lazy from 1ae43f80f0 to efebe91369 2023-12-07 17:03:01 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from efebe91369 to f77afa643b 2023-12-07 17:04:11 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from f77afa643b to b398d6ca88 2023-12-07 17:05:17 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from b398d6ca88 to 026c2507d4 2023-12-11 10:39:23 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from 026c2507d4 to bdf3c02b94 2023-12-11 15:32:55 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from bdf3c02b94 to fde08db5ad 2023-12-11 15:38:52 +00:00 Compare
aarifullin approved these changes 2023-12-12 08:50:53 +00:00
acid-ant reviewed 2023-12-12 10:11:24 +00:00
@ -1122,6 +1200,8 @@ func (c *cfg) signalWatcher(ctx context.Context) {
}
func (c *cfg) reloadConfig(ctx context.Context) {
unlock := c.LockAppConfigExclusive()
Collaborator

Looks like mutex here is redundant. This functionality already implemented on top of internals.healthStatus *atomic.Int32.

Looks like mutex here is redundant. This functionality already implemented on top of `internals.healthStatus *atomic.Int32`.
dstepanov-yadro force-pushed feat/grpc_init_lazy from 4e3ba1e7ec to a86299ceae 2023-12-12 14:09:55 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from a86299ceae to 737466f578 2023-12-12 14:11:18 +00:00 Compare
dstepanov-yadro force-pushed feat/grpc_init_lazy from 737466f578 to 034b884088 2023-12-12 14:58:20 +00:00 Compare
elebedeva approved these changes 2023-12-13 12:52:50 +00:00
dstepanov-yadro force-pushed feat/grpc_init_lazy from 034b884088 to 61da7dca24 2023-12-14 13:39:32 +00:00 Compare
aarifullin approved these changes 2023-12-14 13:45:59 +00:00
fyrchik merged commit 61da7dca24 into master 2023-12-18 07:46:09 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#836
There is no content yet.