Add pool Update #204

achuprov · 2024-02-19T10:05:40Z

achuprov commented

2024-02-19 10:05:40 +00:00

Add the Update function to refresh the connection list in the pool.

Close #30

Add the `Update` function to refresh the connection list in the `pool`. Close #30

achuprov changed title from ~~feat/pool_update~~ to Add pool Update

2024-02-19 10:06:07 +00:00

achuprov requested review from storage-core-committers 2024-02-19 13:08:36 +00:00

achuprov requested review from storage-core-developers 2024-02-19 13:08:40 +00:00

dstepanov-yadro reviewed 2024-02-19 13:15:13 +00:00

pool/pool.go Outdated

					
				@ -1868,0 +1869,4 @@

					p.cancel = cancel

					p.closedCh = make(chan struct{})

					err := p.dial(ctx, nil)

dstepanov-yadro commented

2024-02-19 13:15:13 +00:00

Hm, does it work? You pass nil as existClients, but in p.dial there is no nil validation

Hm, does it work? You pass `nil` as `existClients`, but in `p.dial` there is no nil validation

achuprov commented

2024-02-19 13:37:58 +00:00

If existClients is nil, then ok will be false.
d37d313460/pool/pool.go (L1893)

If `existClients` is `nil`, then `ok` will be `false`. https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/src/commit/d37d31346095c2a9683cc0161546404e1bcb8a99/pool/pool.go#L1893

dstepanov-yadro marked this conversation as resolved

dstepanov-yadro requested review from storage-services-committers 2024-02-19 15:24:43 +00:00

dstepanov-yadro requested review from storage-services-developers 2024-02-19 15:24:43 +00:00

fyrchik reviewed 2024-02-19 15:30:24 +00:00

pool/pool.go Outdated

					
				@ -1920,0 +1943,4 @@

				func (p *Pool) Update(ctx context.Context, prm []NodeParam) error {

					for _, pool := range p.innerPools {

						pool.lock.Lock()

						defer pool.lock.Unlock()

fyrchik commented

2024-02-19 15:30:23 +00:00

Hm, I though some linter would complain about defer in loop.
Not that it is necessarily bad, but what prevents us from more fine-grained locking scheme?

Hm, I though some linter would complain about `defer` in loop. Not that it is necessarily bad, but what prevents us from more fine-grained locking scheme?

👍 1

achuprov commented

2024-02-21 10:02:57 +00:00

Fixed

fyrchik marked this conversation as resolved

dstepanov-yadro reviewed 2024-02-19 15:33:12 +00:00

pool/pool.go Outdated

					
				@ -1920,0 +1969,4 @@

					}

					for _, client := range existClients {

						client.client.close()

dstepanov-yadro commented

2024-02-19 15:33:12 +00:00

What about error handling?

achuprov commented

2024-02-21 10:07:11 +00:00

fixed

dstepanov-yadro commented

2024-02-22 08:20:53 +00:00

Now first error returns. I suppose to try close all connections that are not necessary and return join error.

achuprov commented

2024-02-26 09:31:31 +00:00

fixed

fyrchik marked this conversation as resolved

dstepanov-yadro reviewed 2024-02-19 15:36:34 +00:00

pool/pool.go Outdated

					
				@ -1873,1 +1892,4 @@

						for j, addr := range params.addresses {

							if client, ok := existClients[NodeParam{address: addr, priority: params.priority}]; ok {

								clients[j] = client.client

								_ = p.cache.Put(formCacheKey(addr, p.key, false), client.session)

dstepanov-yadro commented

2024-02-19 15:36:33 +00:00

Why do you put existed client to cache again?

👍 1

achuprov commented

2024-02-21 08:58:11 +00:00

Fixed. Add cache.Purge()

Fixed. Add `cache.Purge()`

dstepanov-yadro commented

2024-02-21 09:18:41 +00:00

And why do you purged existed client?

achuprov commented

2024-02-21 10:02:43 +00:00

The update method might bring in a completely new list of nodes, causing the cache to contain some garbage for a while.

The `update` method might bring in a completely new list of nodes, causing the cache to contain some garbage for a while.

dkirillov commented

2024-02-21 12:09:13 +00:00

Do we really need this? It seems this brings unnecessary complexity. I suppose we are OK with having some old invalid cache values (it seems no code that can use it and fail because of them) for a while

achuprov commented

2024-02-26 09:29:31 +00:00

fixed

dkirillov marked this conversation as resolved

dkirillov reviewed 2024-02-20 07:53:07 +00:00

pool/pool.go Outdated

					
				@ -1913,4 +1933,1 @@

					}

					p.maxObjectSize = ni.MaxObjectSize()

					go p.startRebalance(ctx)

dkirillov commented

2024-02-20 07:44:44 +00:00

We have to stop previous rebalance routine before run new one. Otherwise, goroutines will start to leak

achuprov commented

2024-02-21 10:04:19 +00:00

Added cancel function.

Added `cancel` function.

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1920,0 +1940,4 @@

				}

				// Update is a method that lets you refresh the list of nodes without recreating the pool.

				func (p *Pool) Update(ctx context.Context, prm []NodeParam) error {

dkirillov commented

2024-02-20 07:47:18 +00:00

Probably we have to mention that context must be long lived, otherwise rebalance will stop

achuprov commented

2024-02-21 10:04:39 +00:00

fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1920,0 +1941,4 @@

				// Update is a method that lets you refresh the list of nodes without recreating the pool.

				func (p *Pool) Update(ctx context.Context, prm []NodeParam) error {

					for _, pool := range p.innerPools {

dkirillov commented

2024-02-20 07:50:55 +00:00

We have to protect p.innerPools field, because in p.dial we update it

We have to protect `p.innerPools` field, because in `p.dial` we update it

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1920,0 +1961,4 @@

						return err

					}

					p.rebalanceParams.nodesParams = nodesParams

dkirillov commented

2024-02-20 07:39:37 +00:00

Should we protect this write by mutex (also for read on 1951 line)?

Should we protect this `write` by mutex (also for `read` on 1951 line)?

achuprov commented

2024-02-21 10:06:53 +00:00

You're right. Added a mutex.

dkirillov marked this conversation as resolved

achuprov force-pushed feat/pool_update from d37d313460 to 4bfb6d87a3

2024-02-21 08:54:48 +00:00

Compare

achuprov force-pushed feat/pool_update from 4bfb6d87a3 to a23a3ef51b

2024-02-21 09:49:48 +00:00

Compare

dkirillov reviewed 2024-02-21 12:06:51 +00:00

pool/pool.go Outdated

					
				@ -1905,4 +1937,1 @@

					ctx, cancel := context.WithCancel(ctx)

					p.cancel = cancel

					p.closedCh = make(chan struct{})

					p.innerPools = inner

dkirillov commented

2024-02-21 11:57:00 +00:00

We have to protect this write operation because in p.Update we read this field

We have to protect this `write` operation because in `p.Update` we read this field

dkirillov commented

2024-02-21 11:59:51 +00:00

Actually we don't protect p.innerPools across the code at all currently

Actually we don't protect `p.innerPools` across the code at all currently

dkirillov commented

2024-02-21 12:02:39 +00:00

The same for p.rebalanceParams.nodesParams in updateInnerNodesHealth

The same for `p.rebalanceParams.nodesParams` in `updateInnerNodesHealth`

achuprov commented

2024-02-26 09:29:19 +00:00

Fixed. I've added atomic.Pointer to Pool, so we no longer need a mutex.

dkirillov marked this conversation as resolved

aarifullin reviewed 2024-02-22 10:28:22 +00:00

pool/pool.go Outdated

					
				@ -1866,2 +1869,4 @@

				// See also InitParameters.SetClientRebalanceInterval.

				func (p *Pool) Dial(ctx context.Context) error {

					ctx, cancel := context.WithCancel(ctx)

					p.cancel = cancel

aarifullin commented

2024-02-22 10:28:22 +00:00

Hm, I don't see any reason to create the context with cancellation and pass it out by p.cancel = cancel: here are no functions that are startup-ed within goroutine

Hm, I don't see any reason to create the context with cancellation and pass it out by `p.cancel = cancel`: here are no functions that are startup-ed within goroutine

achuprov commented

2024-02-26 09:29:01 +00:00

You're right. Fixed. Now, p.cancel is used to stop the rebalance.

You're right. Fixed. Now, `p.cancel` is used to stop the `rebalance`.

👍 1

aarifullin marked this conversation as resolved

achuprov force-pushed feat/pool_update from a23a3ef51b to 858ef53b13

2024-02-26 09:17:32 +00:00

Compare

aarifullin approved these changes 2024-02-26 09:52:01 +00:00

dkirillov requested changes 2024-02-28 07:38:42 +00:00

pool/pool.go Outdated

					
				@ -1793,0 +1984,4 @@

				func (p *Pool) Update(ctx context.Context, prm []NodeParam) error {

					p.pool.Load().stopRebalance()

					newPool := *p.pool.Load()

dkirillov commented

2024-02-28 07:32:46 +00:00

Here we make copy of pool so pool.cancelLock is also copying by value that lead to potential deadlock (sometimes we can copy pool that contains locked mutex (that was locked in stopRebalace in case of using Update from multiple threads), but after copying by value unlock in origin mutex doesn't affect copy).

Run this test multiple times to get a deadlock.

func TestUpdateNodeDeadlock(t *testing.T) {
	key1 := newPrivateKey(t)
	mockClientBuilder := func(addr string) client {
		return newMockClient(addr, *key1)
	}

	opts := InitParameters{
		key:        newPrivateKey(t),
		nodeParams: []NodeParam{{1, "peer0", 1}},
	}
	opts.setClientBuilder(mockClientBuilder)

	pool, err := NewPool(opts)
	require.NoError(t, err)
	err = pool.Dial(context.Background())
	require.NoError(t, err)
	t.Cleanup(pool.Close)

	wg := sync.WaitGroup{}

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func(i int) {
			defer wg.Done()
			err := pool.Update(context.Background(), []NodeParam{{1, "peer" + strconv.Itoa(i+1), 1}})
			require.NoError(t, err)
		}(i)
	}

	wg.Wait()
}

Here we make copy of `pool` so `pool.cancelLock` is also copying by value that lead to potential deadlock (sometimes we can copy pool that contains locked mutex (that was locked in `stopRebalace` in case of using `Update` from multiple threads), but after copying by value unlock in origin mutex doesn't affect copy). Run this test multiple times to get a deadlock. ```golang func TestUpdateNodeDeadlock(t *testing.T) { key1 := newPrivateKey(t) mockClientBuilder := func(addr string) client { return newMockClient(addr, *key1) } opts := InitParameters{ key: newPrivateKey(t), nodeParams: []NodeParam{{1, "peer0", 1}}, } opts.setClientBuilder(mockClientBuilder) pool, err := NewPool(opts) require.NoError(t, err) err = pool.Dial(context.Background()) require.NoError(t, err) t.Cleanup(pool.Close) wg := sync.WaitGroup{} for i := 0; i < 10; i++ { wg.Add(1) go func(i int) { defer wg.Done() err := pool.Update(context.Background(), []NodeParam{{1, "peer" + strconv.Itoa(i+1), 1}}) require.NoError(t, err) }(i) } wg.Wait() } ```

achuprov commented

2024-03-04 09:26:42 +00:00

Fixed. p.cancel is now a pointer. This solves the issue:

    Update_1 starts

    Update_1 stops rebalance on pool

    Update_1 cp pool pool_1

    Update_2 starts

    Update_2 stops rebalance on pool

    Update_2 cp pool pool_2

    Update_1 mv pool_1 pool

    Update_2 mv pool_2 pool // p.cancel_1 leaks

Fixed. `p.cancel` is now a pointer. This solves the issue: ``` Update_1 starts Update_1 stops rebalance on pool Update_1 cp pool pool_1 Update_2 starts Update_2 stops rebalance on pool Update_2 cp pool pool_2 Update_1 mv pool_1 pool Update_2 mv pool_2 pool // p.cancel_1 leaks ```

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1793,0 +1986,4 @@

					newPool := *p.pool.Load()

					if err := newPool.Update(ctx, prm); err != nil {

						return err

dkirillov commented

2024-02-28 07:34:14 +00:00

In case of error old rebalance routine be stopped, but new one not started. It seems this is not what we want.

achuprov commented

2024-03-04 09:26:56 +00:00

Fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1874,0 +2087,4 @@

							if client, ok := existClients[NodeParam{address: addr, priority: params.priority}]; ok {

								clients[j] = client

								delete(existClients, NodeParam{address: addr, priority: params.priority})

								continue

dkirillov commented

2024-02-28 07:36:18 +00:00

In case of using only existing clients, flag atLeastOneHealhty be false and we get error though all clients are healhy

In case of using only existing clients, flag `atLeastOneHealhty` be false and we get error though all clients are healhy

achuprov commented

2024-03-04 09:27:10 +00:00

Fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1921,0 +2163,4 @@

						return err

					}

					err = nil

dkirillov commented

2024-02-28 07:36:59 +00:00

err here already nil

`err` here already `nil`

achuprov commented

2024-03-04 09:27:18 +00:00

Fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1921,0 +2164,4 @@

					}

					err = nil

					for _, client := range existClients {

dkirillov commented

2024-02-28 07:38:38 +00:00

Can we add comment that existClients here contains only not used anymore clients?

achuprov commented

2024-03-04 09:27:25 +00:00

Fixed

dkirillov marked this conversation as resolved

achuprov force-pushed feat/pool_update from 858ef53b13 to da6b36bf36

2024-03-04 07:58:09 +00:00

Compare

achuprov force-pushed feat/pool_update from da6b36bf36 to 29b133505d

2024-03-04 09:20:10 +00:00

Compare

fyrchik requested changes 2024-03-04 10:16:46 +00:00

pool/pool.go Outdated

					
				@ -1793,0 +1988,4 @@

					p.pool.Load().stopRebalance()

					defer p.pool.Load().startRebalance(ctx)

					newPool := *p.pool.Load()

fyrchik commented

2024-03-04 10:16:17 +00:00

Implicit logic tied with dereferencing can be a source of bugs.

I suggest the following scheme:

Check whether the list of nodes is different, if not, do nothing.
If there are changes, we explicitly create and copy it to a new structure (also create new clients and remove removed ones).

Basically, newPool.Update the line below can have the following signature (func (*pool).Update(...) (*pool, bool, error)). If it returns true, we replace the pointer.

This allows us to remove those foreach mutexes, because the old instance is valid.
This allows us to have 0 overhead for non-update threads when nothing has changed.

After this it will become more obvious whether we need atomics here (it is beautifully written, but a set of yet another wrappers rubs me the wrong way).

Implicit logic tied with dereferencing can be a source of bugs. I suggest the following scheme: 1. Check whether the list of nodes is different, if not, do nothing. 2. If there are changes, we explicitly create and copy it to a new structure (also create new clients and remove removed ones). Basically, `newPool.Update` the line below can have the following signature (`func (*pool).Update(...) (*pool, bool, error)`). If it returns true, we replace the pointer. This allows us to remove those `foreach` mutexes, because the old instance is valid. This allows us to have 0 overhead for non-update threads when nothing has changed. After this it will become more obvious whether we need atomics here (it is beautifully written, but a set of yet another wrappers rubs me the wrong way).

dkirillov commented

2024-03-04 11:23:46 +00:00

If there are changes, we explicitly create and copy it to a new structure

Do you mean copy pool structure or innerPools field?

Actually, I don't like changing the pool itself. Probably we can make innerPool more complex entity (struct), that incapsulates rebalance and other things. So we can thеn create new one struct (as you suggested) and replace it, or somehow change it inside.

> 2. If there are changes, we explicitly create and copy it to a new structure Do you mean copy `pool` structure or `innerPools` field? Actually, I don't like changing the pool itself. Probably we can make `innerPool` more complex entity (struct), that incapsulates rebalance and other things. So we can thеn create new one struct (as you suggested) and replace it, or somehow change it inside.

fyrchik commented

2024-03-05 07:16:57 +00:00

I mean if there are no changes, there is no need to do anything, right?
Otherwise as you described, create new struct and close removed clients.

I mean if there are no changes, there is no need to do anything, right? Otherwise as you described, create new struct and close removed clients.

achuprov commented

2024-03-05 08:39:45 +00:00

I made the logic involving dereferencing more explicit and added a check for the absence of changes.

dkirillov commented

2024-03-11 11:50:38 +00:00

By the way, can we remove using atomic for pool itself? We won't invoke Pool.Update often, more often we will just invoke some other methods on pool (that use p.pool.Load()). So we have something like read-heavy operations and it would be nice to read Pool pointer by multiple thread without any synchronization.

By the way, can we remove using atomic for `pool` itself? We won't invoke `Pool.Update` often, more often we will just invoke some other methods on pool (that use `p.pool.Load()`). So we have something like read-heavy operations and it would be nice to read Pool pointer by multiple thread without any synchronization.

achuprov commented

2024-03-11 16:43:16 +00:00

Removing atomic for the pool itself won't greatly improve performance (4 picoseconds), since our operations are network-related. However, it increases the risk of errors.

benchmark results:


goos: linux
goarch: amd64
pkg: benchPtr
cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
=== RUN   BenchmarkAtomicPointerReadConcurrent
BenchmarkAtomicPointerReadConcurrent
BenchmarkAtomicPointerReadConcurrent-8          1000000000               0.3064 ns/op          0 B/op          0 allocs/op
=== RUN   BenchmarkAtomicPointerRWRarely
BenchmarkAtomicPointerRWRarely
BenchmarkAtomicPointerRWRarely-8                1000000000               0.3414 ns/op          0 B/op          0 allocs/op
=== RUN   BenchmarkNormalPointerReadConcurrent
BenchmarkNormalPointerReadConcurrent
BenchmarkNormalPointerReadConcurrent-8          1000000000               0.2827 ns/op          0 B/op          0 allocs/op
=== RUN   BenchmarkNormalPointerRWRarely
BenchmarkNormalPointerRWRarely
BenchmarkNormalPointerRWRarely-8                1000000000               0.3319 ns/op          0 B/op          0 allocs/op
=== RUN   BenchmarkRWMutexPointerRConcurrent
BenchmarkRWMutexPointerRConcurrent
BenchmarkRWMutexPointerRConcurrent-8            34726486                34.34 ns/op            0 B/op          0 allocs/op
=== RUN   BenchmarkRWMutexPointerRWRarely
BenchmarkRWMutexPointerRWRarely
BenchmarkRWMutexPointerRWRarely-8               35075522                33.48 ns/op            0 B/op          0 allocs/op
PASS
ok      benchPtr        3.846s

Removing atomic for the pool itself won't greatly improve performance (4 picoseconds), since our operations are network-related. However, it increases the risk of errors. <details><summary>benchmark results:</summary> ``` goos: linux goarch: amd64 pkg: benchPtr cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz === RUN BenchmarkAtomicPointerReadConcurrent BenchmarkAtomicPointerReadConcurrent BenchmarkAtomicPointerReadConcurrent-8 1000000000 0.3064 ns/op 0 B/op 0 allocs/op === RUN BenchmarkAtomicPointerRWRarely BenchmarkAtomicPointerRWRarely BenchmarkAtomicPointerRWRarely-8 1000000000 0.3414 ns/op 0 B/op 0 allocs/op === RUN BenchmarkNormalPointerReadConcurrent BenchmarkNormalPointerReadConcurrent BenchmarkNormalPointerReadConcurrent-8 1000000000 0.2827 ns/op 0 B/op 0 allocs/op === RUN BenchmarkNormalPointerRWRarely BenchmarkNormalPointerRWRarely BenchmarkNormalPointerRWRarely-8 1000000000 0.3319 ns/op 0 B/op 0 allocs/op === RUN BenchmarkRWMutexPointerRConcurrent BenchmarkRWMutexPointerRConcurrent BenchmarkRWMutexPointerRConcurrent-8 34726486 34.34 ns/op 0 B/op 0 allocs/op === RUN BenchmarkRWMutexPointerRWRarely BenchmarkRWMutexPointerRWRarely BenchmarkRWMutexPointerRWRarely-8 35075522 33.48 ns/op 0 B/op 0 allocs/op PASS ok benchPtr 3.846s ``` </details>

fyrchik commented

2024-03-19 13:26:48 +00:00

I fail to see how it increases a risk of errors, atomics are not a silver bullet they just have different tradeoffs.
However, it seems even with mutexes we need to take some RLock in all methods to protect access to a current pool instance, in terms of LOC written it is similar to a wrapper (this PR), so I do not mind having atomics here.

I fail to see how it increases a risk of errors, atomics are not a silver bullet they just have different tradeoffs. However, it seems even with mutexes we need to take some `RLock` in all methods to protect access to a current pool instance, in terms of LOC written it is similar to a wrapper (this PR), so I do not mind having atomics here.

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1921,0 +2140,4 @@

				}

				func (p *pool) Update(ctx context.Context, prm []NodeParam) error {

					foreach(p.innerPools, func(pool *innerPool) { pool.lock.Lock() })

fyrchik commented

2024-03-04 10:06:55 +00:00

It doesn't address my comment, we still take all the mutexes.

fyrchik marked this conversation as resolved

achuprov force-pushed feat/pool_update from 29b133505d to 38102795fe

2024-03-04 14:54:37 +00:00

Compare

achuprov force-pushed feat/pool_update from 38102795fe to bf46fa7e74

2024-03-04 15:06:36 +00:00

Compare

achuprov force-pushed feat/pool_update from bf46fa7e74 to 11554102c5

2024-03-05 08:31:46 +00:00

Compare

fyrchik requested changes 2024-03-11 08:27:58 +00:00

pool/pool.go Outdated

					
				@ -1793,0 +1987,4 @@

				// Use a long-lived context to avoid early rebalance stop.

				// Can interrupt an operation being performed on a node that was removed.

				func (p *Pool) Update(ctx context.Context, prm []NodeParam) error {

					p.pool.Load().stopRebalance()

fyrchik commented

2024-03-11 08:24:54 +00:00

This doesn't prevent multiple concurrent Update executions. And what you Load() can change between the lines.
What about performing Load() to a local variable?

This doesn't prevent multiple concurrent `Update` executions. And what you `Load()` can change between the lines. What about performing `Load()` to a local variable?

achuprov commented

2024-03-11 16:27:14 +00:00

fixed

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1868,0 +2084,4 @@

					return nil

				}

				func (p *pool) dial(ctx context.Context, existClients map[NodeParam]client) error {

fyrchik commented

2024-03-11 08:26:36 +00:00

More like existingClients?

More like `existingClients`?

achuprov commented

2024-03-11 16:27:25 +00:00

fixed

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1921,0 +2141,4 @@

					}

					for i, v := range a {

						if !reflect.DeepEqual(v, b[i]) {

fyrchik commented

2024-03-11 08:06:22 +00:00

Please, let's not use reflect when it can be avoided, nodesParam as only 3 fields.
Also, inside nodesParam we should probably allow different order of addresses field if weights are the same, because they all have the same priority and connection is random anyway. Need some validation here @dkirillov

Anyway, this could be done in a separate issue it seems your code reuses connections anyway.
One way to ensure this is to perform sorting based on (weight, address)

Please, let's not use reflect when it can be avoided, `nodesParam` as only 3 fields. Also, inside `nodesParam` we should probably allow different order of `addresses` field if `weights` are the same, because they all have the same priority and connection is random anyway. Need some validation here @dkirillov Anyway, this could be done in a separate issue it seems your code reuses connections anyway. One way to ensure this is to perform sorting based on `(weight, address)`

dkirillov commented

2024-03-11 11:39:18 +00:00

Also, inside nodesParam we should probably allow different order of addresses field if weights are the same, because they all have the same priority and connection is random anyway. Need some validation here @dkirillov

Yes. Also, it's related to priority, because we use sort.Slice (rather than sort.SliceStable ) here

> Also, inside nodesParam we should probably allow different order of addresses field if weights are the same, because they all have the same priority and connection is random anyway. Need some validation here @dkirillov Yes. Also, it's related to `priority`, because we use `sort.Slice` (rather than `sort.SliceStable` ) [here](https://git.frostfs.info/achuprov/frostfs-sdk-go/src/commit/11554102c53ae5a7f84b1bcf2e9d8484cdbb9a39/pool/pool.go#L2263)

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1921,0 +2155,4 @@

					existClients := make(map[NodeParam]client)

					for i, pool := range newPool.rebalanceParams.nodesParams {

						for j := range pool.weights {

							existClients[NodeParam{address: pool.addresses[j], priority: pool.priority}] = newPool.innerPools[i].clients[j]

fyrchik commented

2024-03-11 08:10:02 +00:00

The keytype in this map should be an address: if the client has just changed weight or priority, the connection should not be closed.

The keytype in this map should be an `address`: if the client has just changed weight or priority, the connection should not be closed.

achuprov commented

2024-03-11 16:26:50 +00:00

fixed

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1921,0 +2175,4 @@

						return nil, false, err

					}

					//Removing outdated clients

fyrchik commented

2024-03-11 08:12:12 +00:00

No need for comment here, quite obvious from the code.

dkirillov commented

2024-03-11 11:28:12 +00:00

I don't think this is quite obvious. We need to look in dial method to find out that existClients map is being changed.

I don't think this is quite obvious. We need to look in `dial` method to find out that `existClients` map is being changed.

fyrchik commented

2024-03-11 13:41:03 +00:00

The comment says remove, the cycle below just closes clients.
To me the comment doesn't mention about existClients processing in dial

The comment says remove, the cycle below just closes clients. To me the comment doesn't mention about existClients processing in `dial`

achuprov commented

2024-03-11 16:26:36 +00:00

fixed

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1921,0 +2178,4 @@

					//Removing outdated clients

					for _, client := range existClients {

						if clientErr := client.close(); clientErr != nil {

							err = clientErr

fyrchik commented

2024-03-11 08:15:25 +00:00

May we use multierr for multiple errors here?
Also, it would be nice to use some dedicated error here, to avoid confusion on caller and describe it in function comment:

This function guarantees that
1. Preserved connections would not be closed.
2. On any error, old pool will remain functional, so the error may just be logged.

May we use multierr for multiple errors here? Also, it would be nice to use some dedicated error here, to avoid confusion on caller and describe it in function comment: ``` This function guarantees that 1. Preserved connections would not be closed. 2. On any error, old pool will remain functional, so the error may just be logged. ```

achuprov commented

2024-03-11 16:29:24 +00:00

Added errors.Join to handle multiple errors

Added `errors.Join` to handle multiple errors

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -2004,0 +2271,4 @@

					p.cancelLock.Lock()

					//stop rebalance

					if p.cancel != nil {

fyrchik commented

2024-03-11 08:19:30 +00:00

Could you explain, why do we have *context.CancelFunc and not context.CancelFunc?

Could you explain, why do we have `*context.CancelFunc` and not `context.CancelFunc`?

achuprov commented

2024-03-11 16:26:12 +00:00

Previously, *context.CancelFunc protected against losing context.CancelFunc() during concurrent execution of Update. Now, Update includes a mutex.

Previously, `*context.CancelFunc` protected against losing `context.CancelFunc()` during concurrent execution of `Update`. Now, `Update` includes a mutex.

fyrchik marked this conversation as resolved

achuprov force-pushed feat/pool_update from 11554102c5 to 55e4f7505d

2024-03-11 16:20:31 +00:00

Compare

fyrchik requested changes 2024-03-12 18:28:56 +00:00

pool/pool.go Outdated

					
				@ -1789,10 +1789,230 @@ type resCreateSession struct {

				// (e.g. sdkClient.IsErrContainerNotFound, sdkClient.IsErrObjectNotFound).

				//

				// See pool package overview to get some examples.

fyrchik commented

2024-03-12 12:50:11 +00:00

empty line

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1793,0 +1997,4 @@

					pool := p.pool.Load()

					pool.stopRebalance()

					defer pool.startRebalance(ctx)

fyrchik commented

2024-03-12 18:22:08 +00:00

We store newPool, but we call startRebalance on the old instance, is it expected?

We store `newPool`, but we call `startRebalance` on the old instance, is it expected?

achuprov commented

2024-03-15 11:20:57 +00:00

Yes, we need to stop Rebalance in the old version of the pool.

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1921,0 +2193,4 @@

						return nil, false, err

					}

					//After newPool.dial(ctx, existingClients), existingClients will contain only outdated clients.

fyrchik commented

2024-03-12 18:28:37 +00:00

Let's use space after // this is how all comments in this repo are written.

Let's use space after `//` this is how all comments in this repo are written.

fyrchik commented

2024-03-12 18:28:38 +00:00

Let's use space after // this is how all comments in this repo are written.

Let's use space after `//` this is how all comments in this repo are written.

achuprov commented

2024-03-18 09:31:49 +00:00

fixed

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -2004,0 +2301,4 @@

					p.cancel = cancel

					go p.rebalance(rebalanceCtx)

					p.cancelLock.Unlock()

fyrchik commented

2024-03-12 18:24:07 +00:00

Can it be in defer?

achuprov commented

2024-03-18 09:31:42 +00:00

fixed

fyrchik marked this conversation as resolved

pool/pool.go Outdated

					
				@ -2023,0 +2328,4 @@

					p.cancelLock.Lock()

					defer p.cancelLock.Unlock()

					if p.cancel != nil {

fyrchik commented

2024-03-12 18:25:48 +00:00

Can p.cancel ever be nil?

Can `p.cancel` ever be nil?

achuprov commented

2024-03-18 09:31:33 +00:00

fixed

fyrchik marked this conversation as resolved

achuprov force-pushed feat/pool_update from 55e4f7505d to 9529bb75a9

2024-03-15 10:59:34 +00:00

Compare

achuprov force-pushed feat/pool_update from 9529bb75a9 to d0537954c7

2024-03-15 12:05:35 +00:00

Compare

achuprov force-pushed feat/pool_update from d0537954c7 to fcc8e869f4

2024-03-18 09:14:04 +00:00

Compare

achuprov force-pushed feat/pool_update from fcc8e869f4 to e6d56f6be4

2024-03-18 09:14:32 +00:00

Compare

fyrchik reviewed 2024-03-19 13:23:32 +00:00

pool/pool.go Outdated

					
				@ -1817,0 +1875,4 @@

				func (p *Pool) Dial(ctx context.Context) error {

					pool := p.pool.Load()

					err := pool.Dial(ctx)

					pool.startRebalance(ctx)

fyrchik commented

2024-03-19 13:17:24 +00:00

If err != nil, we are leaking goroutine here (err on dial should not require any explicit Close)

If `err != nil`, we are leaking goroutine here (`err` on dial should not require any explicit `Close`)

achuprov commented

2024-03-22 09:51:02 +00:00

Already fixed

fyrchik marked this conversation as resolved

achuprov force-pushed feat/pool_update from e6d56f6be4 to fa85c792cd

2024-03-19 14:06:12 +00:00

Compare

fyrchik reviewed 2024-03-19 14:40:36 +00:00

pool/pool.go Outdated

					
				@ -1817,0 +1876,4 @@

					pool := p.pool.Load()

					err := pool.Dial(ctx)

					if pool.cancel != nil {

fyrchik commented

2024-03-19 14:40:36 +00:00

Why not just check if err != nil?
Then Dial can ensure that it doesn't start any routines unless nil is returned (this was the previous behaviour I believe).

Why not just check `if err != nil`? Then `Dial` can ensure that it doesn't start any routines unless nil is returned (this was the previous behaviour I believe).

achuprov commented

2024-03-22 09:53:48 +00:00

It seems this comment is also not relevant anymore. We initiate rebalance if Dial does not return an error.
9b05e56e52/pool/pool.go (L1877)

It seems this comment is also not relevant anymore. We initiate `rebalance` if `Dial` does not return an error. https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/src/commit/9b05e56e52623f8a892e73f804b2cba157d43e7b/pool/pool.go#L1877

fyrchik marked this conversation as resolved

dkirillov reviewed 2024-03-19 14:51:24 +00:00

pool/pool.go Outdated

					
				@ -1817,0 +1876,4 @@

					pool := p.pool.Load()

					err := pool.Dial(ctx)

					if pool.cancel != nil {

dkirillov commented

2024-03-19 14:35:57 +00:00

Do we really need check this (especially without thread-safe acquiring)? We can add comment that calling this method multiple time leads to undefined behavior and client code must not operate pool this way.

achuprov commented

2024-03-19 15:33:29 +00:00

fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1817,0 +1879,4 @@

					if pool.cancel != nil {

						pool.cancel()

					}

					pool.startRebalance(ctx)

dkirillov commented

2024-03-19 14:31:55 +00:00

Why do we always start rebalance?

It seems we should return error after pool.Dial (if error is occurred) immediately

Why do we always start rebalance? It seems we should return error after `pool.Dial` (if error is occurred) immediately

achuprov commented

2024-03-19 15:33:46 +00:00

fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1817,0 +2023,4 @@

					defer p.lockUpdate.Unlock()

					pool := p.pool.Load()

					pool.stopRebalance()

dkirillov commented

2024-03-19 14:46:54 +00:00

Do we really need stop rebalancing even if new pool equals to old one?

achuprov commented

2024-03-19 15:24:26 +00:00

fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1945,0 +2192,4 @@

					return true

				}

				// It is necessary to prevent parallel operations on the *pool instance.

dkirillov commented

2024-03-19 14:28:51 +00:00

Comment should have the following format 'Update ...'

achuprov commented

2024-03-19 15:24:02 +00:00

fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1945,0 +2193,4 @@

				}

				// It is necessary to prevent parallel operations on the *pool instance.

				func (p *pool) Update(ctx context.Context, prm []NodeParam) (*pool, bool, error) {

dkirillov commented

2024-03-19 14:29:02 +00:00

Why is this method exported?

achuprov commented

2024-03-19 15:23:50 +00:00

fixed

dkirillov marked this conversation as resolved

achuprov force-pushed feat/pool_update from fa85c792cd to 3984aeafb2

2024-03-19 15:18:41 +00:00

Compare

achuprov force-pushed feat/pool_update from 3984aeafb2 to fdc7f95a33

2024-03-19 15:23:31 +00:00

Compare

achuprov force-pushed feat/pool_update from fdc7f95a33 to 6d87fb5716

2024-03-19 15:32:44 +00:00

Compare

dkirillov reviewed 2024-03-21 12:25:27 +00:00

pool/pool.go Outdated

					
				@ -1817,0 +1876,4 @@

					pool := p.pool.Load()

					err := pool.Dial(ctx)

					if err != nil {

dkirillov commented

2024-03-21 12:25:17 +00:00

Can we don't put empty line above?
Probably we can write if err := pool.Dial(ctx); err != nil {

Can we don't put empty line above? Probably we can write `if err := pool.Dial(ctx); err != nil {`

achuprov commented

2024-03-21 14:07:15 +00:00

fixed

dkirillov marked this conversation as resolved

pool/pool.go Outdated

					
				@ -1817,0 +1881,4 @@

					}

					if pool.cancel != nil {

						pool.cancelLock.Lock()

dkirillov commented

2024-03-21 12:24:20 +00:00

Actually we should guard if pool.canecl != nil { too

Actually we should guard `if pool.canecl != nil {` too

achuprov commented

2024-03-21 14:04:59 +00:00

Yes, this check is necessary. On the first invocation of Dial, pool.cancel will be nil. If Dial is called again, we can properly stop the rebalance.

Yes, this check is necessary. On the first invocation of `Dial`, `pool.cancel` will be `nil`. If `Dial` is called again, we can properly stop the rebalance.

dkirillov commented

2024-03-21 14:34:37 +00:00

I meant that if we guard pool.cancel() by mutex, we also must guard pool.cancel != nil by mutex

I meant that if we guard `pool.cancel()` by mutex, we also must guard `pool.cancel != nil` by mutex

achuprov commented

2024-03-21 15:04:43 +00:00

Oh, you're correct. I've corrected it.

dkirillov marked this conversation as resolved

achuprov force-pushed feat/pool_update from 7b118ba668 to b88567767c

2024-03-21 14:05:57 +00:00

Compare

achuprov force-pushed feat/pool_update from b88567767c to 043586501f

2024-03-21 14:08:23 +00:00

Compare

achuprov force-pushed feat/pool_update from 043586501f to 9b05e56e52

2024-03-21 15:00:37 +00:00

Compare

alexvanin commented

2024-03-21 15:10:35 +00:00

@fyrchik @dkirillov Can you resolve fixed comments of yours? I want to join review party but it is a bit hard to track what is still being discussed and what is not at this point.

👍 1

fyrchik commented

2024-03-22 08:55:32 +00:00

@alexvanin It was harder than I thought because each time I click "resolve conversation" the page updates and is scrolled to top, what a wonderful UX!

fyrchik commented

2024-04-05 07:41:55 +00:00

@alexvanin @dkirillov This PR is waiting for your approval

dkirillov commented

2024-04-08 09:20:44 +00:00

We still don't resolve this comment #204 (comment)

We still don't resolve this comment https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/pulls/204#issuecomment-34997

dkirillov reviewed 2024-04-08 09:33:26 +00:00

pool/pool.go Outdated

					
				@ -1817,0 +1817,4 @@

					pool atomic.Pointer[pool]

					// Ensures that pool.cancel will not be lost due to concurrent calls to Update().

					lockUpdate sync.Mutex

dkirillov commented

2024-04-08 09:33:26 +00:00

If we have this mutex that protect Pool from concurrent Update invocation do we really need the additional atomic that we use for the same purpose?
I understand that this allows not to block pool operations during update procedure but it looks like too complicated having two different sync primitives that are intersected in pool protection

If we have this mutex that protect `Pool` from concurrent `Update` invocation do we really need the additional atomic that we use for the same purpose? I understand that this allows not to block pool operations during update procedure but it looks like too complicated having two different sync primitives that are intersected in pool protection

achuprov commented

2024-04-08 17:16:29 +00:00

Fixed. Used atomic.Swap instead of atomic.Store.

Fixed. Used `atomic.Swap` instead of `atomic.Store`.