Fix connection leak and panic at close operation #171

Merged
alexvanin merged 3 commits from alexvanin/frostfs-sdk-go:fix/pool-panic-leak into master 2024-09-04 19:51:15 +00:00
Owner

This PR fixes two issues.

Connection leak at restarts

Consider endpoint which accepts pool connections, but returns error on all RPCs like getting node info or session token.

After successful dial(), client keeps an open connection, but it is marked as unhealthy. After rebalance interval, unhealthy connection is re-established in restartIfUnhealthy function by creating new client and calling dial() once again. While this is correct behaviour, pool should close unhealthy connection before that.

To do that, pool should maintain dialing status. The easiest way is to modify health status to store three states:

  • dial failed,
  • dial ok, but then connection is failed,
  • everything is ok.

Then restartIfUnhealthy may close all dialed connection and avoid connection leak during restarts.

Panic at connection closing

If some pool connection are dead at initialization or during rebalance, then dial() returns error and connection is not established. Pool can't call Close() on such connections, because it leads to panics.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x8aa088]
goroutine 365 [running]:
google.golang.org/grpc.(*ClientConn).Close(0x0)
        google.golang.org/grpc@v1.56.1/clientconn.go:1195 +0x48    
git.frostfs.info/TrueCloudLab/frostfs-sdk-go/client.(*Client).Close(...)
        git.frostfs.info/TrueCloudLab/frostfs-sdk-go@v0.0.0-20230920091613-99c273f4993b/client/client.go:140
git.frostfs.info/TrueCloudLab/frostfs-sdk-go/pool.(*clientWrapper).close(0x13bfb20?)
        git.frostfs.info/TrueCloudLab/frostfs-sdk-go@v0.0.0-20230920091613-99c273f4993b/pool/pool.go:1087 +0x29
git.frostfs.info/TrueCloudLab/frostfs-sdk-go/pool.(*Pool).Close(0xc0005b6480)

New health status allows to close only dialed connections therefore avoid panic.

This PR fixes two issues. ## Connection leak at restarts Consider endpoint which accepts pool connections, but returns error on all RPCs like getting node info or session token. After successful `dial()`, client keeps an open connection, but it is marked as _unhealthy_. After rebalance interval, _unhealthy_ connection is re-established in `restartIfUnhealthy` function by creating new client and calling `dial()` once again. While this is correct behaviour, pool should close unhealthy connection before that. To do that, pool should maintain dialing status. The easiest way is to modify health status to store three states: - dial failed, - dial ok, but then connection is failed, - everything is ok. Then `restartIfUnhealthy` may close all dialed connection and avoid connection leak during restarts. ## Panic at connection closing If some pool connection are dead at initialization or during rebalance, then `dial()` returns error and connection is not established. Pool can't call `Close()` on such connections, because it leads to panics. ``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x8aa088] goroutine 365 [running]: google.golang.org/grpc.(*ClientConn).Close(0x0) google.golang.org/grpc@v1.56.1/clientconn.go:1195 +0x48 git.frostfs.info/TrueCloudLab/frostfs-sdk-go/client.(*Client).Close(...) git.frostfs.info/TrueCloudLab/frostfs-sdk-go@v0.0.0-20230920091613-99c273f4993b/client/client.go:140 git.frostfs.info/TrueCloudLab/frostfs-sdk-go/pool.(*clientWrapper).close(0x13bfb20?) git.frostfs.info/TrueCloudLab/frostfs-sdk-go@v0.0.0-20230920091613-99c273f4993b/pool/pool.go:1087 +0x29 git.frostfs.info/TrueCloudLab/frostfs-sdk-go/pool.(*Pool).Close(0xc0005b6480) ``` New health status allows to close only dialed connections therefore avoid panic.
alexvanin added the
bug
pool
labels 2023-10-03 14:00:19 +00:00
alexvanin self-assigned this 2023-10-03 14:00:19 +00:00
alexvanin force-pushed fix/pool-panic-leak from 9a950b4195 to de65d27636 2023-10-03 14:06:14 +00:00 Compare
alexvanin requested review from fyrchik 2023-10-03 14:07:23 +00:00
alexvanin requested review from dstepanov-yadro 2023-10-03 14:07:24 +00:00
alexvanin requested review from dkirillov 2023-10-03 14:07:24 +00:00
alexvanin requested review from pogpp 2023-10-03 14:07:25 +00:00
alexvanin requested review from mbiryukova 2023-10-03 14:07:25 +00:00
alexvanin requested review from r.loginov 2023-10-03 14:07:26 +00:00
fyrchik approved these changes 2023-10-03 14:13:57 +00:00
pool/pool.go Outdated
@ -2780,3 +2812,3 @@
for _, pools := range p.innerPools {
for _, cli := range pools.clients {
_ = cli.close()
if cli.isDialed() {
Owner

Wouldn't it be better from the client POV to just allow closing even undialed clients?

Wouldn't it be better from the client POV to just allow closing even undialed clients?
Author
Owner

It would, but simple solution down below didn't work, so I suppose changes should be done on api-go level as well.

func (c *Client) Close() error {
	if c.c.Conn() != nil {
		return c.c.Conn().Close()
	}
	return nil

Function comment clearly tells us MUST NOT be called before successful Dial. So invoker (pool) should check dial state, which it does now.

It would, but simple solution down below didn't work, so I suppose changes should be done on api-go level as well. ```go func (c *Client) Close() error { if c.c.Conn() != nil { return c.c.Conn().Close() } return nil ``` Function comment clearly tells us `MUST NOT be called before successful Dial`. So invoker (pool) should check dial state, which it does now.
dkirillov approved these changes 2023-10-03 14:31:57 +00:00
dstepanov-yadro requested changes 2023-10-03 16:07:55 +00:00
pool/pool.go Outdated
@ -86,6 +86,8 @@ type client interface {
type clientStatus interface {
// isHealthy checks if the connection can handle requests.
isHealthy() bool
// isDialled checks if the connection was created.

isDialled -> isDialed

isDialled -> isDialed
alexvanin force-pushed fix/pool-panic-leak from de65d27636 to 60463871db 2023-10-03 16:47:29 +00:00 Compare
alexvanin requested review from dstepanov-yadro 2023-10-04 06:24:33 +00:00
dstepanov-yadro approved these changes 2023-10-04 06:59:57 +00:00
alexvanin merged commit 60463871db into master 2023-10-04 07:32:18 +00:00
alexvanin deleted branch fix/pool-panic-leak 2023-10-04 07:32:21 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-sdk-go#171
No description provided.