Reconnect gRPC servers #836

dstepanov-yadro · 2023-11-30T17:52:55Z

dstepanov-yadro commented

2023-11-30 17:52:55 +00:00

Now gRPC servers will try to reconnect in case of unavailable endpoint.

Tested add new IP address to existed interface

On dev-env add new gRPC endpoint: - FROSTFS_GRPC_2_ENDPOINT=100.100.100.1:8080
On startup

2023-12-04T08:31:49.403Z        info    frostfs-node/main.go:75 initializing gRPC service...
2023-12-04T08:31:49.404Z        error   frostfs-node/grpc.go:33 can't listen gRPC endpoint      {"error": "listen tcp 100.100.100.1:8080: bind: cannot assign requested address"}

After ip a add 100.100.100.1 dev eth0:

2023-12-04T08:32:49.404Z        warn    frostfs-node/grpc.go:100        failed to reconnect gRPC server {"next_try_in": 60}
2023-12-04T08:33:49.404Z        info    frostfs-node/grpc.go:68 reconnecting gRPC server...     {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/netmap.go:168      register object service {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/accounting.go:35   register accounting service     {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/container.go:43    register container service      {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/session.go:64      register session service        {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/object.go:212      register object service {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.404Z        info    frostfs-node/tree.go:69 register tree service   {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.405Z        info    frostfs-node/grpc.go:97 gRPC server reconnected successfully    {"endpoint": "100.100.100.1:8080"}
2023-12-04T08:33:49.405Z        info    frostfs-node/grpc.go:170        start listening endpoint        {"service": "gRPC", "endpoint": "100.100.100.1:8080"}

Close #835

Now gRPC servers will try to reconnect in case of unavailable endpoint. Tested add new IP address to existed interface 1. On dev-env add new gRPC endpoint: `- FROSTFS_GRPC_2_ENDPOINT=100.100.100.1:8080` 2. On startup ``` 2023-12-04T08:31:49.403Z info frostfs-node/main.go:75 initializing gRPC service... 2023-12-04T08:31:49.404Z error frostfs-node/grpc.go:33 can't listen gRPC endpoint {"error": "listen tcp 100.100.100.1:8080: bind: cannot assign requested address"} ``` 3. After `ip a add 100.100.100.1 dev eth0`: ``` 2023-12-04T08:32:49.404Z warn frostfs-node/grpc.go:100 failed to reconnect gRPC server {"next_try_in": 60} 2023-12-04T08:33:49.404Z info frostfs-node/grpc.go:68 reconnecting gRPC server... {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/netmap.go:168 register object service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/accounting.go:35 register accounting service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/container.go:43 register container service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/session.go:64 register session service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/object.go:212 register object service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.404Z info frostfs-node/tree.go:69 register tree service {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.405Z info frostfs-node/grpc.go:97 gRPC server reconnected successfully {"endpoint": "100.100.100.1:8080"} 2023-12-04T08:33:49.405Z info frostfs-node/grpc.go:170 start listening endpoint {"service": "gRPC", "endpoint": "100.100.100.1:8080"} ``` Close #835

dstepanov-yadro force-pushed feat/grpc_init_lazy from 00fe695c7b to ccf55bd659

2023-12-01 09:55:03 +00:00

Compare

dstepanov-yadro force-pushed feat/grpc_init_lazy from ccf55bd659 to 52aa926dcd

2023-12-01 11:05:02 +00:00

Compare

dstepanov-yadro force-pushed feat/grpc_init_lazy from 52aa926dcd to 992d125421

2023-12-01 12:40:27 +00:00

Compare

dstepanov-yadro force-pushed feat/grpc_init_lazy from 992d125421 to df916520aa

2023-12-01 13:05:41 +00:00

Compare

dstepanov-yadro changed title from ~~WIP: grpc reconnect~~ to Reconnect gRPC servers

2023-12-01 13:10:34 +00:00

requested reviews from storage-core-committers, storage-core-developers

2023-12-01 13:10:43 +00:00

acid-ant approved these changes 2023-12-01 13:41:34 +00:00

fyrchik reviewed 2023-12-07 14:45:04 +00:00

cmd/frostfs-node/config.go Outdated

					
				@ -449,18 +449,79 @@ func (c *cfg) ReadCurrentNetMap(msg *netmapV2.NetMap) error {

					return nil

				}

				type grpcConnection struct {

fyrchik commented

2023-12-07 14:35:31 +00:00

Is it really a connection?

Is it really a _connection_?

dstepanov-yadro commented

2023-12-07 16:49:39 +00:00

renamed to server

fyrchik marked this conversation as resolved

cmd/frostfs-node/config.go Outdated

					
				@ -464,0 +519,4 @@

						return

					}

					c.connections = append(c.connections[0:pos], c.connections[pos:]...)

fyrchik commented

2023-12-07 14:44:15 +00:00

Don't we need to cleanup all resources (it seems listener can be open, this will prevent us from using this IP on the next reconnection)?

Don't we need to cleanup all resources (it seems `listener` can be open, this will prevent us from using this IP on the next reconnection)?

fyrchik commented

2023-12-07 14:44:53 +00:00

Also c[:pos] + c[pos:] seems equal c, do we forget +-1 somewhere?

Also `c[:pos] + c[pos:]` seems equal `c`, do we forget +-1 somewhere?

dstepanov-yadro commented

2023-12-07 16:49:02 +00:00

Don't we need to cleanup all resources (it seems listener can be open, this will prevent us from using this IP on the next reconnection)?

grpc.Server.Serve will close listener:

func (s *Server) Serve(lis net.Listener) error {
...
	ls := &listenSocket{Listener: lis}
	s.lis[ls] = true

	defer func() {
		s.mu.Lock()
		if s.lis != nil && s.lis[ls] {
			ls.Close() <-- closes listener
			delete(s.lis, ls)
		}
		s.mu.Unlock()
	}()

But ok, added explicit Stop call.

> Don't we need to cleanup all resources (it seems `listener` can be open, this will prevent us from using this IP on the next reconnection)? `grpc.Server.Serve` will close listener: ``` func (s *Server) Serve(lis net.Listener) error { ... ls := &listenSocket{Listener: lis} s.lis[ls] = true defer func() { s.mu.Lock() if s.lis != nil && s.lis[ls] { ls.Close() <-- closes listener delete(s.lis, ls) } s.mu.Unlock() }() ``` But ok, added explicit `Stop` call.

👍 1

dstepanov-yadro commented

2023-12-07 16:49:24 +00:00

Also c[:pos] + c[pos:] seems equal c, do we forget +-1 somewhere?

Fixed

> Also `c[:pos] + c[pos:]` seems equal `c`, do we forget +-1 somewhere? Fixed

fyrchik marked this conversation as resolved

cmd/frostfs-node/grpc.go Outdated

					
				@ -93,1 +51,4 @@

					}

					for _, endpoint := range endpointsToReconnect {

						scheduleReconnect(endpoint, c)

fyrchik commented

2023-12-07 14:39:00 +00:00

Why have you decided to use separate goroutines to handle each reconnection?

dstepanov-yadro commented

2023-12-07 16:53:34 +00:00

Each address tries to reconnect independently so that waiting for one address does not affect the speed of reconnecting to the others.

fyrchik marked this conversation as resolved

cmd/frostfs-node/grpc.go Outdated

					
				@ -94,0 +67,4 @@

							case <-t.C:

								c.log.Info(logs.FrostFSNodeGRPCReconnecting, zap.String("endpoint", endpoint))

								var success bool

								grpcconfig.IterateEndpoints(c.appCfg, func(sc *grpcconfig.Config) {

fyrchik commented

2023-12-07 14:33:16 +00:00

Will there be a leak if we remove the endpoint from the configuration and send SIGHUP while this goroutine is running?

fyrchik commented

2023-12-07 14:39:44 +00:00

Also, this can be executed concurrently to SIGHUP, how can we easily see no data-race occurs?

dstepanov-yadro commented

2023-12-07 16:57:30 +00:00

Well, I used ostrich tactics in this case: as I see SIGHUP handling now is not threadsafe in general.

dstepanov-yadro commented

2023-12-07 17:00:49 +00:00

Will there be a leak if we remove the endpoint from the configuration and send SIGHUP while this goroutine is running?

If there is no current endpoint after sighup, then the reconnect goroutine will return. Or did I misunderstand the question?

> Will there be a leak if we remove the endpoint from the configuration and send SIGHUP while this goroutine is running? If there is no current endpoint after sighup, then the reconnect goroutine will return. Or did I misunderstand the question?

fyrchik commented

2023-12-11 10:07:17 +00:00

as I see SIGHUP handling now is not threadsafe in general

What do you mean? Do we already have places like this (non-threadsafe)? e.g. for engine it is thread safe.

>as I see SIGHUP handling now is not threadsafe in general What do you mean? Do we already have places like this (non-threadsafe)? e.g. for engine it is thread safe.

dstepanov-yadro commented

2023-12-11 10:38:06 +00:00

func initApp(ctx context.Context, c *cfg) {
	c.wg.Add(1)
	go func() {
		c.signalWatcher(ctx) <-- here app starts to accept signals and it is possible that config will be updated concurrently with initAndLog calls
		c.wg.Done()
	}()

	initAndLog(c, ...)
	initAndLog(c, ...)
	initAndLog(c, ...)
	initAndLog(c, ...)

``` func initApp(ctx context.Context, c *cfg) { c.wg.Add(1) go func() { c.signalWatcher(ctx) <-- here app starts to accept signals and it is possible that config will be updated concurrently with initAndLog calls c.wg.Done() }() initAndLog(c, ...) initAndLog(c, ...) initAndLog(c, ...) initAndLog(c, ...) ```

fyrchik commented

2023-12-11 13:08:19 +00:00

That's true (and it is a bug), but initApp is called only during startup, there is much less probability to receive SIGHUP here
With grpc servers constantly relistening we introduce the race under normal operation.

That's true (and it is a bug), but `initApp` is called only during startup, there is much less probability to receive SIGHUP here With grpc servers constantly relistening we introduce the race under normal operation.

fyrchik commented

2023-12-11 13:10:23 +00:00

If it looks like being hard to add now we can do it in #855, but IMO this is a high-priority task.

If it looks like being hard to add now we can do it in https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/855, but IMO this is a high-priority task.

dstepanov-yadro commented

2023-12-11 15:38:06 +00:00

Ok, fixed

acid-ant commented

2023-12-12 10:10:56 +00:00

We are execution reloadConfig only for state control.HealthStatus_READY, in other cases we are skipping reloading. Status will be ready only when all initAndLog calls will be compleated. Also, we have special status HealthStatus_RECONFIGURING. So if we need to reload grpc, at the first step we need to stop connection and wait for completion of the all background tasks. Looks like mutex here is redundant. This functionality already implemented on top of internals.healthStatus *atomic.Int32.

We are execution `reloadConfig` only for state control.HealthStatus_READY, in other cases we are skipping reloading. Status will be ready only when all `initAndLog` calls will be compleated. Also, we have special status HealthStatus_RECONFIGURING. So if we need to reload grpc, at the first step we need to stop connection and wait for completion of the all background tasks. Looks like mutex here is redundant. This functionality already implemented on top of `internals.healthStatus *atomic.Int32`.

dstepanov-yadro commented

2023-12-12 14:53:19 +00:00

Thanks, my statement about sighup threadsafe is incorrect.

But the situation when node recreates gRPC server is not RECONFIGURING: node is healthy. Also gRPC reconnect should not to wait full SIGHUP.

Health status CompareAndSwap strategy allows to run sighup OR one gRPC reconnect. It is too strict.

Mutex locks only appConfig reload and allows many goroutines to read it.

Thanks, my statement about sighup threadsafe is incorrect. But the situation when node recreates gRPC server is not RECONFIGURING: node is healthy. Also gRPC reconnect should not to wait full SIGHUP. Health status `CompareAndSwap` strategy allows to run sighup OR one gRPC reconnect. It is too strict. Mutex locks only appConfig reload and allows many goroutines to read it.

dstepanov-yadro force-pushed feat/grpc_init_lazy from df916520aa to 1ae43f80f0

2023-12-07 16:48:16 +00:00

Compare

dstepanov-yadro reviewed 2023-12-07 16:58:27 +00:00

cmd/frostfs-node/config.go Outdated

					
				@ -1180,3 +1242,3 @@

					c.ctxCancel()

					c.done <- struct{}{}

					close(c.done)

dstepanov-yadro commented

2023-12-07 16:58:27 +00:00

To have many waiters.

dstepanov-yadro force-pushed feat/grpc_init_lazy from 1ae43f80f0 to efebe91369

2023-12-07 17:03:01 +00:00

Compare

dstepanov-yadro force-pushed feat/grpc_init_lazy from efebe91369 to f77afa643b

2023-12-07 17:04:11 +00:00

Compare

dstepanov-yadro force-pushed feat/grpc_init_lazy from f77afa643b to b398d6ca88

2023-12-07 17:05:17 +00:00

Compare

dstepanov-yadro force-pushed feat/grpc_init_lazy from b398d6ca88 to 026c2507d4

2023-12-11 10:39:23 +00:00

Compare

fyrchik referenced this pull request

2023-12-11 13:09:46 +00:00

SIGHUP should be thread-safe #855

dstepanov-yadro force-pushed feat/grpc_init_lazy from 026c2507d4 to bdf3c02b94

2023-12-11 15:32:55 +00:00

Compare

dstepanov-yadro force-pushed feat/grpc_init_lazy from bdf3c02b94 to fde08db5ad

2023-12-11 15:38:52 +00:00

Compare

aarifullin approved these changes 2023-12-12 08:50:53 +00:00

acid-ant reviewed 2023-12-12 10:11:24 +00:00

cmd/frostfs-node/config.go Outdated

					
				@ -1122,6 +1200,8 @@ func (c *cfg) signalWatcher(ctx context.Context) {

				}

				func (c *cfg) reloadConfig(ctx context.Context) {

					unlock := c.LockAppConfigExclusive()