[#291] server auto re-binding #309

Merged
alexvanin merged 1 commit from pogpp/frostfs-s3-gw:feature/291-rebinding into master 2024-03-27 16:08:29 +00:00
Member

Signed-off-by: Pavel Pogodaev p.pogodaev@yadro.com

Signed-off-by: Pavel Pogodaev <p.pogodaev@yadro.com>
pogpp force-pushed feature/291-rebinding from 3a66494805 to 07c5d6a15b 2024-02-12 09:04:15 +00:00 Compare
alexvanin reviewed 2024-02-13 13:15:05 +00:00
alexvanin left a comment
Owner

Can we look at one more thing, that is not implemented in storage, as far as I see.

What happens when link is down on one of the connected endpoints?

  1. Can application detect it by looking at the Server structure?
  2. Is it required to reschedule reconnect when link is up again?
  3. Can we do that in runtime?
Can we look at one more thing, that **is not** implemented in storage, as far as I see. What happens when link is down on one of the connected endpoints? 1) Can application detect it by looking at the `Server` structure? 2) Is it required to reschedule reconnect when link is up again? 3) Can we do that in runtime?
cmd/s3-gw/app.go Outdated
@ -983,1 +989,4 @@
}
func (a *App) scheduleReconnect(ctx context.Context) {
a.wg.Add(1)
Owner

waitgroup is used when application awaits several jobs to be done, e.g.

for job := range jobs {
    wg.Add(1)
    go func() {
        wg.Done()
        // do work
    }
}
wg.Wait() // wait until all jobs are done in go-routines to continue

Storage node awaits all routines in main function.

S3 gateway, on the other hand, does not use wait group, but awaits channel closing.

I suggest to remove waitgroup here and do not await for reconnection in case of closing applications.

waitgroup is used when application awaits several jobs to be done, e.g. ```go for job := range jobs { wg.Add(1) go func() { wg.Done() // do work } } wg.Wait() // wait until all jobs are done in go-routines to continue ``` Storage node [awaits](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d69d318cb0d3ba8ce2d652606fa1192ac9a42b97/cmd/frostfs-node/main.go#L171) all routines in main function. S3 gateway, on the other hand, does not use wait group, but awaits [channel closing](https://git.frostfs.info/TrueCloudLab/frostfs-s3-gw/src/commit/cc34f659d17c314cc3273023e874b6c73d1744fe/cmd/s3-gw/app.go#L656). I suggest to remove waitgroup here and do not await for reconnection in case of closing applications.
alexvanin marked this conversation as resolved
dkirillov reviewed 2024-02-14 08:09:43 +00:00
cmd/s3-gw/app.go Outdated
@ -984,0 +993,4 @@
go func() {
defer a.wg.Done()
t := time.NewTicker(a.cfg.GetDuration(cfgReconnectInterval))
Member

Why do we use a.cfg? It seems we have already got reconnect interval and save it to settings before.

Now if we don't specify this param in config we get:

panic: non-positive interval for NewTicker

goroutine 296 [running]:
time.NewTicker(0xc001aa5c20?)
        /home/denis/go/go1.20.4/src/time/tick.go:24 +0x10f
main.(*App).scheduleReconnect.func1()
        /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:996 +0x9e
created by main.(*App).scheduleReconnect
        /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:993 +0xaa

Why do we use `a.cfg`? It seems we have already got reconnect interval and save it to settings before. Now if we don't specify this param in config we get: ``` panic: non-positive interval for NewTicker goroutine 296 [running]: time.NewTicker(0xc001aa5c20?) /home/denis/go/go1.20.4/src/time/tick.go:24 +0x10f main.(*App).scheduleReconnect.func1() /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:996 +0x9e created by main.(*App).scheduleReconnect /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:993 +0xaa ```
Member

Let's also use

defer t.Stop()

after ticker initialization

and

t.Reset()

after unsuccessful reconnect

Let's also use ```golang defer t.Stop() ``` after ticker initialization and ```golang t.Reset() ``` after unsuccessful reconnect
dkirillov marked this conversation as resolved
cmd/s3-gw/app.go Outdated
@ -984,0 +1016,4 @@
zap.String("tls cert", serverInfo.TLS.CertFile), zap.String("tls key", serverInfo.TLS.KeyFile),
}
srv, err := newServer(ctx, serverInfo)
Member

It's not enough just create new server listener. We have to serve this listener, see 07c5d6a15b/cmd/s3-gw/app.go (L707-L716)

It's not enough just create new server listener. We have to serve this listener, see https://git.frostfs.info/pogpp/frostfs-s3-gw/src/commit/07c5d6a15be87848d84495e18a83291e5b206003/cmd/s3-gw/app.go#L707-L716
Member

Still valid

Still valid
dkirillov marked this conversation as resolved
cmd/s3-gw/app.go Outdated
@ -984,0 +1020,4 @@
if err != nil {
a.log.Error(logs.ServerReconnectFailed, zap.Error(err))
return false
Member

Why we exit function on first error? I think it's better to try other servers reconnect too

Why we exit function on first error? I think it's better to try other servers reconnect too
dkirillov marked this conversation as resolved
cmd/s3-gw/app.go Outdated
@ -984,0 +1024,4 @@
}
a.metrics.MarkHealthy(serverInfo.Address)
a.servers = append(a.servers, srv)
Member

It seems we have to guard a.servers by mutex, because we can read in separate goroutine

It seems we have to guard `a.servers` by mutex, because we [can read in separate goroutine](https://git.frostfs.info/pogpp/frostfs-s3-gw/src/commit/07c5d6a15be87848d84495e18a83291e5b206003/cmd/s3-gw/app.go#L886)
Member

a.serverIndex method still reads a.server without guard. The same for e831c5de82/cmd/s3-gw/app.go (L706)

`a.serverIndex` method still reads `a.server` without guard. The same for https://git.frostfs.info/pogpp/frostfs-s3-gw/src/commit/e831c5de824bd64595afa0a355bb6ffa694d8a38/cmd/s3-gw/app.go#L706
dkirillov marked this conversation as resolved
@ -25,6 +25,8 @@ peers:
priority: 2
weight: 0.9
reconnect_interval: 60000000000
Member

I believe we can write reconnect_interval: 1m.
Also we should add this param to config.env

I believe we can write `reconnect_interval: 1m`. Also we should add this param to `config.env`
Member

We also have to update docs/configuration.md and changelog

We also have to update `docs/configuration.md` and changelog
Member

config.env still missing

`config.env` still missing
Member

still valid

still valid
dkirillov marked this conversation as resolved
pogpp force-pushed feature/291-rebinding from 07c5d6a15b to 4e5c1905d4 2024-02-14 11:45:05 +00:00 Compare
pogpp force-pushed feature/291-rebinding from 4e5c1905d4 to e831c5de82 2024-02-14 17:40:05 +00:00 Compare
pogpp closed this pull request 2024-02-19 07:23:56 +00:00
pogpp reopened this pull request 2024-02-19 07:24:05 +00:00
pogpp changed title from WIP: [#291] server auto re-binding to [#291] server auto re-binding 2024-02-19 07:24:34 +00:00
dkirillov reviewed 2024-02-20 13:28:08 +00:00
cmd/s3-gw/app.go Outdated
@ -877,6 +885,8 @@ func (a *App) updateServers() error {
}
func (a *App) serverIndex(address string) int {
a.settings.mu.RLock()
Member

I don't like using mutex from settings. Can we add separate mutext?

I don't like using mutex from settings. Can we add separate mutext?
dkirillov marked this conversation as resolved
dkirillov reviewed 2024-02-20 13:29:01 +00:00
cmd/s3-gw/app.go Outdated
@ -969,0 +983,4 @@
}
func (a *App) getServer(index int) Server {
a.settings.mu.RLock()
Member

Let's accept address here and return nil in case of missing. Then we don't need a.serverIndex method

Let's accept address here and return nil in case of missing. Then we don't need `a.serverIndex` method
dkirillov marked this conversation as resolved
dkirillov reviewed 2024-02-20 13:30:04 +00:00
cmd/s3-gw/app.go Outdated
@ -984,0 +1047,4 @@
a.log.Info(logs.StartingServer, zap.String("address", srv.Address()))
if err = sr.Serve(srv.Listener()); err != nil && !errors.Is(err, http.ErrServerClosed) {
a.metrics.MarkUnhealthy(srv.Address())
Member

Ineffective setting unhealthy because below we invoke a.metrics.MarkHealthy anyway

Ineffective setting unhealthy because below we invoke `a.metrics.MarkHealthy` anyway
Member

Should we invoke srv.Listener().Close() and try again init server on next attempt?

Should we invoke `srv.Listener().Close()` and try again init server on next attempt?
dkirillov marked this conversation as resolved
pogpp force-pushed feature/291-rebinding from 9fd42c1c9e to efc7de7413 2024-02-21 12:35:38 +00:00 Compare
pogpp force-pushed feature/291-rebinding from efc7de7413 to 4b333f5d04 2024-02-21 12:36:30 +00:00 Compare
pogpp force-pushed feature/291-rebinding from 4b333f5d04 to b924d69d83 2024-02-21 12:41:23 +00:00 Compare
dkirillov reviewed 2024-02-26 08:13:08 +00:00
dkirillov left a comment
Member

Please, rebase

Please, rebase
cmd/s3-gw/app.go Outdated
@ -707,1 +708,3 @@
a.metrics.MarkUnhealthy(a.servers[i].Address())
for i := range servs {
go func(i int) {
a.log.Info(logs.StartingServer, zap.String("address", servs[i].Address()))
Member

Actually, in general such using servs[i] isn't safe because in the same time we can update server at i index from other goroutine. But in our current case this cannot happen because in other goroutine we do only append and simple read.

cc @alexvanin

Actually, in general such using `servs[i]` isn't safe because in the same time we can update server at `i` index from other goroutine. But in our current case this cannot happen because in other goroutine we do only append and simple read. cc @alexvanin
Owner

It is similar to passing pointer of servs[i] structure to goroutine. So I think we are okay with this here.

It is similar to passing pointer of `servs[i]` structure to goroutine. So I think we are okay with this here.
dkirillov marked this conversation as resolved
cmd/s3-gw/app.go Outdated
@ -863,3 +868,2 @@
if serverInfo.TLS.Enabled {
if err := a.servers[index].UpdateCert(serverInfo.TLS.CertFile, serverInfo.TLS.KeyFile); err != nil {
if err := a.getServer(serverInfo.Address).UpdateCert(serverInfo.TLS.CertFile, serverInfo.TLS.KeyFile); err != nil {
Member

Here we must check a.getServer for nil otherwise we get panic:

2024-02-26T10:59:15.575+0300    info    s3-gw/app.go:781        SIGHUP config reload started
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xec2ad9]

goroutine 281 [running]:
main.(*App).updateServers(0xc000021900)
        /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:869 +0x99
main.(*App).configReload(0xc000021900, {0x156bc58, 0xc00012efc0})
        /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:796 +0x365
main.(*App).Serve(0xc000021900, {0x156bc58, 0xc00012efc0})
        /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:740 +0x68f
created by main.main in goroutine 1
        /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/main.go:17 +0x105


Here we must check `a.getServer` for `nil` otherwise we get `panic`: ``` 2024-02-26T10:59:15.575+0300 info s3-gw/app.go:781 SIGHUP config reload started panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xec2ad9] goroutine 281 [running]: main.(*App).updateServers(0xc000021900) /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:869 +0x99 main.(*App).configReload(0xc000021900, {0x156bc58, 0xc00012efc0}) /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:796 +0x365 main.(*App).Serve(0xc000021900, {0x156bc58, 0xc00012efc0}) /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/app.go:740 +0x68f created by main.main in goroutine 1 /home/denis/github/tcl/neofs-s3-gw/cmd/s3-gw/main.go:17 +0x105 ```
dkirillov marked this conversation as resolved
cmd/s3-gw/app.go Outdated
@ -877,6 +881,8 @@ func (a *App) updateServers() error {
}
func (a *App) serverIndex(address string) int {
Member

This method isn't being used anymore. Can be dropped

This method isn't being used anymore. Can be dropped
dkirillov marked this conversation as resolved
pogpp force-pushed feature/291-rebinding from b924d69d83 to 8a87d9b15b 2024-02-27 10:46:08 +00:00 Compare
pogpp force-pushed feature/291-rebinding from 8a87d9b15b to 777ae514bc 2024-02-27 10:46:58 +00:00 Compare
dkirillov approved these changes 2024-02-27 12:24:00 +00:00
dkirillov left a comment
Member

LGTM, but we should decide something about #309 (comment)

LGTM, but we should decide something about https://git.frostfs.info/TrueCloudLab/frostfs-s3-gw/pulls/309#issuecomment-34098
Owner

I am gonna merge it after v0.29.0-rc.1 tag, because I want to have this feature in the same HTTP and S3 releases. HTTP is already passed across v0.29.0-rc.1.

I am gonna merge it after `v0.29.0-rc.1` tag, because I want to have this feature in the same HTTP and S3 releases. HTTP is already passed across `v0.29.0-rc.1`.
dkirillov requested changes 2024-02-29 09:08:48 +00:00
dkirillov left a comment
Member

The scenario I want to pass:

  1. Start s3-gw with the following config:
server:
  - address: 0.0.0.0:8084
    tls:
      enabled: true
      cert_file: /valid/path/to/cert.pem
      key_file: /valid/path/to/key.pem
  - address: 0.0.0.0:8185
    tls:
      enabled: true
      cert_file: /not/existing/path/to/cert.pem
      key_file: /not/existing/path/to/key.pem
  1. Change config and send SIGHUP
server:
  - address: 0.0.0.0:8084
    tls:
      enabled: true
      cert_file: /valid/path/to/cert.pem
      key_file: /valid/path/to/key.pem
  - address: 0.0.0.0:8185
    tls:
      enabled: true
      cert_file: /valid/path/to/cert.pem
      key_file: /valid/path/to/key.pem
  1. Reconnect must be succesfull
The scenario I want to pass: 1. Start s3-gw with the following config: ```yaml server: - address: 0.0.0.0:8084 tls: enabled: true cert_file: /valid/path/to/cert.pem key_file: /valid/path/to/key.pem - address: 0.0.0.0:8185 tls: enabled: true cert_file: /not/existing/path/to/cert.pem key_file: /not/existing/path/to/key.pem ``` 2. Change config and send `SIGHUP` ```yaml server: - address: 0.0.0.0:8084 tls: enabled: true cert_file: /valid/path/to/cert.pem key_file: /valid/path/to/key.pem - address: 0.0.0.0:8185 tls: enabled: true cert_file: /valid/path/to/cert.pem key_file: /valid/path/to/key.pem ``` 3. Reconnect must be succesfull
cmd/s3-gw/app.go Outdated
@ -73,2 +73,3 @@
servers []Server
servers []Server
unbindServers []ServerInfo
Member

It seems we have to update unbindServers on SIGHUP, similar to a.updateServer.

cc @alexvanin

It seems we have to update `unbindServers` on `SIGHUP`, similar to `a.updateServer`. cc @alexvanin
dkirillov marked this conversation as resolved
cmd/s3-gw/app.go Outdated
@ -984,0 +1031,4 @@
zap.String("tls cert", serverInfo.TLS.CertFile), zap.String("tls key", serverInfo.TLS.KeyFile),
}
srv, err := newServer(ctx, serverInfo)
Member

We will get error here in the following scenario.
Let's say we have the following config:

server:
  - address: 0.0.0.0:8084
    tls:
      enabled: true
      cert_file: /valid/path/to/cert.pem
      key_file: /valid/path/to/key.pem
  - address: 0.0.0.0:8185
    tls:
      enabled: true
      cert_file: /not/existing/path/to/cert.pem
      key_file: /not/existing/path/to/key.pem

Then we successfully start one server and will try reconnect to another:

2024-02-29T11:57:45.180+0300    info    s3-gw/app.go:855        add server      {"address": "0.0.0.0:8084", "tls enabled": true, "tls cert": "/valid/path/to/cert.pem", "tls key": "/valid/path/to/key.pem"}
2024-02-29T11:57:45.180+0300    warn    s3-gw/app.go:849        failed to add server    {"address": "0.0.0.0:8185", "tls enabled": true, "tls cert": "/not/existing/path/to/cert.pem", "tls key": "/not/existing/path/to/key.pem", "error": "failed to update cert: cannot load TLS key pair from certFile '/not/existing/path/to/cert.pem' and keyFile '/not/existing/path/to/key.pem': open /not/existing/path/to/cert.pem: no such file or directory"}

but on reconnect we get:

2024-02-29T11:58:00.181+0300    info    s3-gw/app.go:1025       reconnecting server...
2024-02-29T11:58:00.181+0300    warn    s3-gw/app.go:1036       failed to reconnect server      {"error": "could not prepare listener: listen tcp 0.0.0.0:8185: bind: address already in use"}

This is because we don't close listerer on error here.

After some trying listener be closed (by timeout I suppose, or by GC):

2024-02-29T12:01:45.193+0300    warn    s3-gw/app.go:1036       failed to reconnect server      {"error": "failed to update cert: cannot load TLS key pair from certFile '/not/existing/path/to/cert.pem' and keyFile '/not/existing/path/to/key.pem': open /not/existing/path/to/cert.pem: no such file or directory"}

but it would be nice explicitly close the listener. (Thank Svace)

We will get error here in the following scenario. Let's say we have the following config: ```yaml server: - address: 0.0.0.0:8084 tls: enabled: true cert_file: /valid/path/to/cert.pem key_file: /valid/path/to/key.pem - address: 0.0.0.0:8185 tls: enabled: true cert_file: /not/existing/path/to/cert.pem key_file: /not/existing/path/to/key.pem ``` Then we successfully start one server and will try reconnect to another: ``` 2024-02-29T11:57:45.180+0300 info s3-gw/app.go:855 add server {"address": "0.0.0.0:8084", "tls enabled": true, "tls cert": "/valid/path/to/cert.pem", "tls key": "/valid/path/to/key.pem"} 2024-02-29T11:57:45.180+0300 warn s3-gw/app.go:849 failed to add server {"address": "0.0.0.0:8185", "tls enabled": true, "tls cert": "/not/existing/path/to/cert.pem", "tls key": "/not/existing/path/to/key.pem", "error": "failed to update cert: cannot load TLS key pair from certFile '/not/existing/path/to/cert.pem' and keyFile '/not/existing/path/to/key.pem': open /not/existing/path/to/cert.pem: no such file or directory"} ``` but on reconnect we get: ``` 2024-02-29T11:58:00.181+0300 info s3-gw/app.go:1025 reconnecting server... 2024-02-29T11:58:00.181+0300 warn s3-gw/app.go:1036 failed to reconnect server {"error": "could not prepare listener: listen tcp 0.0.0.0:8185: bind: address already in use"} ``` This is because we don't close [listerer](https://git.frostfs.info/pogpp/frostfs-s3-gw/src/commit/777ae514bcfc4af035de4cfebcfa413e4a2bc14c/cmd/s3-gw/server.go#L60) on error [here](https://git.frostfs.info/pogpp/frostfs-s3-gw/src/commit/777ae514bcfc4af035de4cfebcfa413e4a2bc14c/cmd/s3-gw/server.go#L71). After some trying listener be closed (by timeout I suppose, or by GC): ``` 2024-02-29T12:01:45.193+0300 warn s3-gw/app.go:1036 failed to reconnect server {"error": "failed to update cert: cannot load TLS key pair from certFile '/not/existing/path/to/cert.pem' and keyFile '/not/existing/path/to/key.pem': open /not/existing/path/to/cert.pem: no such file or directory"} ``` but it would be nice explicitly close the listener. (Thank Svace)
dkirillov marked this conversation as resolved
pogpp force-pushed feature/291-rebinding from 777ae514bc to d3e1753438 2024-03-25 12:00:22 +00:00 Compare
pogpp force-pushed feature/291-rebinding from d3e1753438 to e1ae3f3c29 2024-03-26 07:52:05 +00:00 Compare
dkirillov reviewed 2024-03-26 13:02:48 +00:00
cmd/s3-gw/app.go Outdated
@ -977,0 +1045,4 @@
a.log.Info(logs.ServerReconnecting)
var failedServers []ServerInfo
for _, serverInfo := range a.unbindServers {
Member

Since we update a.unbindServers on SIGHUP we must protect them here by mutex

Since we update `a.unbindServers` on SIGHUP we must protect them here by mutex
dkirillov marked this conversation as resolved
pogpp force-pushed feature/291-rebinding from e1ae3f3c29 to b1ad21edc7 2024-03-26 15:27:16 +00:00 Compare
dkirillov reviewed 2024-03-27 06:38:39 +00:00
cmd/s3-gw/app.go Outdated
@ -977,0 +1066,4 @@
a.log.Info(logs.ServerReconnectedSuccessfully, fields...)
}
a.unbindServers = failedServers
Member

We must protect this too

We must protect this too
dkirillov marked this conversation as resolved
dkirillov reviewed 2024-03-27 06:39:41 +00:00
cmd/s3-gw/app.go Outdated
@ -977,0 +1040,4 @@
a.log.Info(logs.ServerReconnecting)
var failedServers []ServerInfo
servers := a.getUnbindServers()
Member

We cannot just get all servers by mutex and then read them because in updateUnbindServerInfo we update single i-th item (not whole slice)

We cannot just get all servers by mutex and then read them because in `updateUnbindServerInfo` we update single i-th item (not whole slice)
dkirillov marked this conversation as resolved
pogpp force-pushed feature/291-rebinding from b1ad21edc7 to 937407a6da 2024-03-27 08:28:13 +00:00 Compare
pogpp force-pushed feature/291-rebinding from 937407a6da to 7d9ba1a87f 2024-03-27 08:44:14 +00:00 Compare
pogpp force-pushed feature/291-rebinding from 7d9ba1a87f to 573fe68fe2 2024-03-27 09:22:26 +00:00 Compare
pogpp force-pushed feature/291-rebinding from 573fe68fe2 to 32cd3d0b92 2024-03-27 10:34:10 +00:00 Compare
dkirillov approved these changes 2024-03-27 11:18:05 +00:00
pogpp force-pushed feature/291-rebinding from 32cd3d0b92 to bfcde09f07 2024-03-27 11:29:03 +00:00 Compare
dkirillov approved these changes 2024-03-27 14:27:34 +00:00
dkirillov requested review from storage-services-developers 2024-03-27 14:27:40 +00:00
dkirillov requested review from storage-services-committers 2024-03-27 14:27:41 +00:00
alexvanin approved these changes 2024-03-27 16:08:22 +00:00
alexvanin merged commit bfcde09f07 into master 2024-03-27 16:08:29 +00:00
alexvanin deleted branch feature/291-rebinding 2024-03-27 16:08:29 +00:00
Sign in to join this conversation.
No reviewers
TrueCloudLab/storage-services-developers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-s3-gw#309
No description provided.