[#291] server auto re-binding #309
Labels
No labels
P0
P1
P2
P3
good first issue
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-s3-gw#309
Loading…
Reference in a new issue
No description provided.
Delete branch "pogpp/frostfs-s3-gw:feature/291-rebinding"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Signed-off-by: Pavel Pogodaev p.pogodaev@yadro.com
3a66494805
to07c5d6a15b
Can we look at one more thing, that is not implemented in storage, as far as I see.
What happens when link is down on one of the connected endpoints?
Server
structure?@ -983,1 +989,4 @@
}
func (a *App) scheduleReconnect(ctx context.Context) {
a.wg.Add(1)
waitgroup is used when application awaits several jobs to be done, e.g.
Storage node awaits all routines in main function.
S3 gateway, on the other hand, does not use wait group, but awaits channel closing.
I suggest to remove waitgroup here and do not await for reconnection in case of closing applications.
@ -984,0 +993,4 @@
go func() {
defer a.wg.Done()
t := time.NewTicker(a.cfg.GetDuration(cfgReconnectInterval))
Why do we use
a.cfg
? It seems we have already got reconnect interval and save it to settings before.Now if we don't specify this param in config we get:
Let's also use
after ticker initialization
and
after unsuccessful reconnect
@ -984,0 +1016,4 @@
zap.String("tls cert", serverInfo.TLS.CertFile), zap.String("tls key", serverInfo.TLS.KeyFile),
}
srv, err := newServer(ctx, serverInfo)
It's not enough just create new server listener. We have to serve this listener, see
07c5d6a15b/cmd/s3-gw/app.go (L707-L716)
Still valid
@ -984,0 +1020,4 @@
if err != nil {
a.log.Error(logs.ServerReconnectFailed, zap.Error(err))
return false
Why we exit function on first error? I think it's better to try other servers reconnect too
@ -984,0 +1024,4 @@
}
a.metrics.MarkHealthy(serverInfo.Address)
a.servers = append(a.servers, srv)
It seems we have to guard
a.servers
by mutex, because we can read in separate goroutinea.serverIndex
method still readsa.server
without guard. The same fore831c5de82/cmd/s3-gw/app.go (L706)
@ -25,6 +25,8 @@ peers:
priority: 2
weight: 0.9
reconnect_interval: 60000000000
I believe we can write
reconnect_interval: 1m
.Also we should add this param to
config.env
We also have to update
docs/configuration.md
and changelogconfig.env
still missingstill valid
07c5d6a15b
to4e5c1905d4
4e5c1905d4
toe831c5de82
WIP: [#291] server auto re-bindingto [#291] server auto re-binding@ -877,6 +885,8 @@ func (a *App) updateServers() error {
}
func (a *App) serverIndex(address string) int {
a.settings.mu.RLock()
I don't like using mutex from settings. Can we add separate mutext?
@ -969,0 +983,4 @@
}
func (a *App) getServer(index int) Server {
a.settings.mu.RLock()
Let's accept address here and return nil in case of missing. Then we don't need
a.serverIndex
method@ -984,0 +1047,4 @@
a.log.Info(logs.StartingServer, zap.String("address", srv.Address()))
if err = sr.Serve(srv.Listener()); err != nil && !errors.Is(err, http.ErrServerClosed) {
a.metrics.MarkUnhealthy(srv.Address())
Ineffective setting unhealthy because below we invoke
a.metrics.MarkHealthy
anywayShould we invoke
srv.Listener().Close()
and try again init server on next attempt?9fd42c1c9e
toefc7de7413
efc7de7413
to4b333f5d04
4b333f5d04
tob924d69d83
Please, rebase
@ -707,1 +708,3 @@
a.metrics.MarkUnhealthy(a.servers[i].Address())
for i := range servs {
go func(i int) {
a.log.Info(logs.StartingServer, zap.String("address", servs[i].Address()))
Actually, in general such using
servs[i]
isn't safe because in the same time we can update server ati
index from other goroutine. But in our current case this cannot happen because in other goroutine we do only append and simple read.cc @alexvanin
It is similar to passing pointer of
servs[i]
structure to goroutine. So I think we are okay with this here.@ -863,3 +868,2 @@
if serverInfo.TLS.Enabled {
if err := a.servers[index].UpdateCert(serverInfo.TLS.CertFile, serverInfo.TLS.KeyFile); err != nil {
if err := a.getServer(serverInfo.Address).UpdateCert(serverInfo.TLS.CertFile, serverInfo.TLS.KeyFile); err != nil {
Here we must check
a.getServer
fornil
otherwise we getpanic
:@ -877,6 +881,8 @@ func (a *App) updateServers() error {
}
func (a *App) serverIndex(address string) int {
This method isn't being used anymore. Can be dropped
b924d69d83
to8a87d9b15b
8a87d9b15b
to777ae514bc
LGTM, but we should decide something about #309 (comment)
I am gonna merge it after
v0.29.0-rc.1
tag, because I want to have this feature in the same HTTP and S3 releases. HTTP is already passed acrossv0.29.0-rc.1
.The scenario I want to pass:
SIGHUP
@ -73,2 +73,3 @@
servers []Server
servers []Server
unbindServers []ServerInfo
It seems we have to update
unbindServers
onSIGHUP
, similar toa.updateServer
.cc @alexvanin
@ -984,0 +1031,4 @@
zap.String("tls cert", serverInfo.TLS.CertFile), zap.String("tls key", serverInfo.TLS.KeyFile),
}
srv, err := newServer(ctx, serverInfo)
We will get error here in the following scenario.
Let's say we have the following config:
Then we successfully start one server and will try reconnect to another:
but on reconnect we get:
This is because we don't close listerer on error here.
After some trying listener be closed (by timeout I suppose, or by GC):
but it would be nice explicitly close the listener. (Thank Svace)
777ae514bc
tod3e1753438
d3e1753438
toe1ae3f3c29
@ -977,0 +1045,4 @@
a.log.Info(logs.ServerReconnecting)
var failedServers []ServerInfo
for _, serverInfo := range a.unbindServers {
Since we update
a.unbindServers
on SIGHUP we must protect them here by mutexe1ae3f3c29
tob1ad21edc7
@ -977,0 +1066,4 @@
a.log.Info(logs.ServerReconnectedSuccessfully, fields...)
}
a.unbindServers = failedServers
We must protect this too
@ -977,0 +1040,4 @@
a.log.Info(logs.ServerReconnecting)
var failedServers []ServerInfo
servers := a.getUnbindServers()
We cannot just get all servers by mutex and then read them because in
updateUnbindServerInfo
we update single i-th item (not whole slice)b1ad21edc7
to937407a6da
937407a6da
to7d9ba1a87f
7d9ba1a87f
to573fe68fe2
573fe68fe2
to32cd3d0b92
32cd3d0b92
tobfcde09f07