[#305] tree/pool: Add flag to use net map to prioritize tree services #305

Merged
alexvanin merged 1 commit from mbiryukova/frostfs-sdk-go:feature/tree_pool_priority into master 2024-12-18 06:51:00 +00:00
Member

New version of tree/pool selects tree service connection to make request based on the current net map and container placement policy

Signed-off-by: Marina Biryukova m.biryukova@yadro.com

New version of tree/pool selects tree service connection to make request based on the current net map and container placement policy Signed-off-by: Marina Biryukova <m.biryukova@yadro.com>
mbiryukova self-assigned this 2024-12-05 11:10:23 +00:00
mbiryukova added 1 commit 2024-12-05 11:10:23 +00:00
[#xxx] tree/pool: Use net map for prioritize tree services
Some checks failed
DCO / DCO (pull_request) Failing after 1m19s
Tests and linters / Tests (pull_request) Successful in 2m2s
Tests and linters / Lint (pull_request) Successful in 2m51s
4d14c53489
New version of tree/pool selects tree service connection to make request based on the current net map and container placement policy

Signed-off-by: Marina Biryukova <m.biryukova@yadro.com>
mbiryukova force-pushed feature/tree_pool_priority from 4d14c53489 to e25995e12a 2024-12-05 11:11:45 +00:00 Compare
mbiryukova changed title from [#xxx] tree/pool: Use net map for prioritize tree services to [#305] tree/pool: Use net map to prioritize tree services 2024-12-05 11:12:04 +00:00
mbiryukova requested review from storage-core-committers 2024-12-05 11:15:40 +00:00
mbiryukova requested review from storage-core-developers 2024-12-05 11:15:41 +00:00
mbiryukova requested review from storage-sdk-committers 2024-12-05 11:15:42 +00:00
mbiryukova requested review from storage-sdk-developers 2024-12-05 11:15:44 +00:00
mbiryukova requested review from storage-services-committers 2024-12-05 11:15:45 +00:00
mbiryukova requested review from storage-services-developers 2024-12-05 11:15:46 +00:00
acid-ant requested changes 2024-12-05 11:39:26 +00:00
go.mod Outdated
@ -3,6 +3,7 @@ module git.frostfs.info/TrueCloudLab/frostfs-sdk-go
go 1.22
require (
git.frostfs.info/TrueCloudLab/frostfs-api-go/v2 v2.16.0
Member

We already merge sdk-go and api-go here #281
Looks like this new dependency is redundant.

We already merge `sdk-go` and `api-go` here https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/pulls/281 Looks like this new dependency is redundant.
mbiryukova force-pushed feature/tree_pool_priority from e25995e12a to a272af911e 2024-12-06 04:29:07 +00:00 Compare
dkirillov reviewed 2024-12-06 14:49:04 +00:00
@ -0,0 +1,110 @@
package network
Member

It this file is just copy one from storage 7e542906ef/pkg/network/address.go should we place it to more common package rather than pool/tree/network? For example to pkg/network or just network

It this file is just copy one from storage https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/7e542906ef9085a792e5ec5f7185bfdd9acecedb/pkg/network/address.go should we place it to more common package rather than `pool/tree/network`? For example to `pkg/network` or just `network`
dkirillov marked this conversation as resolved
@ -725,4 +731,2 @@
}
// startRebalance runs loop to monitor tree client healthy status.
func (p *Pool) startRebalance(ctx context.Context) {
Member

Why do we drop rebalancing?

Why do we drop rebalancing?
dkirillov marked this conversation as resolved
@ -76,0 +81,4 @@
}
type PlacementPolicySource interface {
PlacementPolicy(cnrID cid.ID) (netmap.PlacementPolicy, bool)
Member

Why isn't this method symmetrical to NetMapSnapshot?
I mean why do we try to get placement policy in tree pool (see Pool.getContainerPlacementPolicy) if PlacementPolicySource couldn't provide it to us?

Why isn't this method symmetrical to `NetMapSnapshot`? I mean why do we try to get placement policy in tree pool (see `Pool.getContainerPlacementPolicy`) if `PlacementPolicySource` couldn't provide it to us?
Author
Member

In most cases placement policy will be provided from s3-gw cache but in other cases this information should be requested from storage. Maybe we should do the same for netmap, but I'm not sure

In most cases placement policy will be provided from s3-gw cache but in other cases this information should be requested from storage. Maybe we should do the same for netmap, but I'm not sure
dkirillov marked this conversation as resolved
@ -844,1 +774,3 @@
if cl, err = p.innerPools[indexI].clients[indexJ].serviceClient(); err == nil {
treeCl, ok := p.clientMap[cnrNode.Hash()]
if !ok {
treeCl, err = p.getNewTreeClient(ctx, cnrNode)
Member

Shouldn't we respect this p.innerPools ?

Shouldn't we respect this `p.innerPools` ?
Author
Member

p.innerPools now doesn't consist of tree service clients, if I understand you correctly

`p.innerPools` now doesn't consist of tree service clients, if I understand you correctly
dkirillov marked this conversation as resolved
mbiryukova force-pushed feature/tree_pool_priority from a272af911e to 00c96617c2 2024-12-12 09:02:20 +00:00 Compare
mbiryukova changed title from [#305] tree/pool: Use net map to prioritize tree services to [#305] tree/pool: Add flag to use net map to prioritize tree services 2024-12-12 10:32:35 +00:00
mbiryukova force-pushed feature/tree_pool_priority from 00c96617c2 to c50f6022ba 2024-12-12 10:40:29 +00:00 Compare
dkirillov reviewed 2024-12-13 14:11:12 +00:00
@ -75,1 +77,4 @@
maxRequestAttempts int
netMapSource NetMapSource
placementPolicySource PlacementPolicySource
v2 bool
Member

Can we use more meaningful name for this param?

Can we use more meaningful name for this param?
dkirillov marked this conversation as resolved
@ -76,0 +84,4 @@
NetMapSnapshot(ctx context.Context) (netmap.NetMap, error)
}
type PlacementPolicySource interface {
Member

question/suggestion: Is it worth to union these two interfaces to one?
Maybe then we don't need v2 field at all

question/suggestion: Is it worth to union these two interfaces to one? Maybe then we don't need `v2` field at all
Author
Member

I made one interface and dropped v2 field

I made one interface and dropped `v2` field
dkirillov marked this conversation as resolved
@ -253,3 +288,2 @@
func (p *Pool) Dial(ctx context.Context) error {
inner := make([]*innerPool, len(p.rebalanceParams.nodesGroup))
var atLeastOneHealthy bool
if !p.v2 {
Member

nitpick: It's be more easy to read if we write

if p.v2 {
    return nil
}
// ...

or extract dial for "v1" to separate function

nitpick: It's be more easy to read if we write ```golang if p.v2 { return nil } // ... ``` or extract dial for "v1" to separate function
dkirillov marked this conversation as resolved
@ -334,6 +372,21 @@ func (x *InitParameters) SetMaxRequestAttempts(maxAttempts int) {
x.maxRequestAttempts = maxAttempts
}
// UseV2 sets flag for using net map to prioritize tree services.
Member

We should mention that if this param is set then AddNode won't have effect

We should mention that if this param is set then `AddNode` won't have effect
dkirillov marked this conversation as resolved
@ -337,0 +383,4 @@
}
// SetPlacementPolicySource sets implementation of interface to get container placement policy.
func (x *InitParameters) SetPlacementPolicySource(placementPolicySource PlacementPolicySource) {
Member

Until we have UseV2 parameter we should mention that these two will only have an effect if UseV2 param is set.

Until we have `UseV2` parameter we should mention that these two will only have an effect if `UseV2` param is set.
dkirillov marked this conversation as resolved
@ -866,10 +930,120 @@ LOOP:
return finErr
}
func (p *Pool) requestWithRetryContainerNodes(ctx context.Context, cid cid.ID, fn func(client *rpcclient.Client) error) error {
Member

nitpick: Can we use variable name that don't collide with package names?
cid -> cnrID

nitpick: Can we use variable name that don't collide with package names? `cid` -> `cnrID`
dkirillov marked this conversation as resolved
@ -869,0 +966,4 @@
if attempts == 0 {
break LOOP
}
attempts--
Member

I'm not sure if we should update this counter until we really we request to this client. The reason is we don't remember last healthy client and every request process cnrNodes in the same order. So if we have 4 node (2 of them unhealthy) and 2 max attempts we 1 or 2 epochs (until bad nodes disappear from netmap) will get errors.
In previous tree pool version we don't have such delay

I'm not sure if we should update this counter until we really we request to this client. The reason is we don't remember last healthy client and every request process `cnrNodes` in the same order. So if we have 4 node (2 of them unhealthy) and 2 max attempts we 1 or 2 epochs (until bad nodes disappear from netmap) will get errors. In previous tree pool version we don't have such delay
Owner

I agree with this statement. However I don't think we are going to have such small number of retries, but this is worth to take into account anyway.

I agree with this statement. However I don't think we are going to have such small number of retries, but this is worth to take into account anyway.
@ -869,0 +968,4 @@
}
attempts--
treeCl, ok := p.clientMap[cnrNode.Hash()]
Member

We must access p.clientMap under mutex because we update it sometimes

We must access `p.clientMap` under mutex because we update it sometimes
dkirillov marked this conversation as resolved
@ -871,2 +1038,4 @@
}
func shouldRedial(err error) bool {
if err == nil || errors.Is(err, ErrNodeAccessDenied) || errors.Is(err, ErrNodeNotFound) || errors.Is(err, errNodeEmptyResult) {
Member

Probably we should be more careful with list of error on which we do redial.
At least we we shouldn't count context canceled error (the similar to handleError function in object pool)

Probably we should be more careful with list of error on which we do redial. At least we we shouldn't count `context canceled` error (the similar to `handleError` function in object pool)
Owner

shouldRedial in general doesn't look very robust, but I don't have anything to suggest here, unfortunately.

`shouldRedial` in general doesn't look very robust, but I don't have anything to suggest here, unfortunately.
dkirillov marked this conversation as resolved
mbiryukova force-pushed feature/tree_pool_priority from c50f6022ba to aa9d8ae72a 2024-12-16 08:56:59 +00:00 Compare
alexvanin requested review from acid-ant 2024-12-16 11:25:53 +00:00
alexvanin approved these changes 2024-12-16 11:42:19 +00:00
Dismissed
alexvanin left a comment
Owner

Looks very compact for such big change, nice job! LGTM.

Looks very compact for such big change, nice job! LGTM.
@ -0,0 +1,110 @@
package network
Owner

I suggest to mention in the header that this code is taken from frostfs-node pkg/network package. Maybe one day it could be imported back.

By the way, let's import some tests too.

I suggest to mention in the header that this code is taken from frostfs-node `pkg/network` package. Maybe one day it could be imported back. By the way, let's import some [tests](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/network/address_test.go) too.
Owner

Refs #35 (if we can immediately reuse this PR from node, then Close #35)

Refs #35 (if we can immediately reuse this PR from node, then `Close #35`)
Owner

I would also like to use this across all the clients/pools, though.

I would also like to use this across all the clients/pools, though.
Author
Member

There are other files in frostfs-node package, should they be added too?

There are other files in frostfs-node package, should they be added too?
Owner

If they are not needed by this PR, we may leave them for #35

If they are not needed by this PR, we may leave them for #35
dkirillov approved these changes 2024-12-16 12:13:51 +00:00
Dismissed
fyrchik requested changes 2024-12-16 12:22:32 +00:00
Dismissed
go.mod Outdated
@ -14,6 +14,7 @@ require (
github.com/klauspost/reedsolomon v1.12.1
github.com/mailru/easyjson v0.7.7
github.com/mr-tron/base58 v1.2.0
github.com/multiformats/go-multiaddr v0.12.1
Owner

The latest version is v0.14.0, why have you stopped on this one?

The latest version is v0.14.0, why have you stopped on this one?
Author
Member

Because it's used in frostfs-node

Because it's used in frostfs-node
Member

Because it's used in frostfs-node

If you are concerned about building frostfs-node, then it's fine - Minimal Version Selection is able to handle this

> Because it's used in frostfs-node If you are concerned about building `frostfs-node`, then it's fine - Minimal Version Selection is able to handle this
Owner

Let's use the latest one, no problem.

Let's use the latest one, no problem.
fyrchik marked this conversation as resolved
@ -99,1 +107,3 @@
startIndicesMtx sync.RWMutex
netMapInfoSource NetMapInfoSource
mutex sync.RWMutex
Owner

Please, make a comment about what is protected with this mutex.

Please, make a comment about what is protected with this mutex.
fyrchik marked this conversation as resolved
@ -251,6 +269,10 @@ func NewPool(options InitParameters) (*Pool, error) {
//
// See also InitParameters.SetClientRebalanceInterval.
func (p *Pool) Dial(ctx context.Context) error {
if p.netMapInfoSource != nil {
Owner

It would've been surprising for me if I wasn't familiar with the pool internals.
What prevents us from not calling Dial if we provide netmapInfoSource?
Doc-comment should be changed at least.

It would've been surprising for me if I wasn't familiar with the pool internals. What prevents us from not calling `Dial` if we provide `netmapInfoSource`? Doc-comment should be changed at least.
fyrchik marked this conversation as resolved
@ -869,0 +997,4 @@
return cl, ok
}
func (p *Pool) setClientToMap(hash uint64, cl client) {
Owner

How about addClientToMap or setClientInMap? set client to map has different meaning.

How about `addClientToMap` or `setClientInMap`? `set client to map` has different meaning.
fyrchik marked this conversation as resolved
@ -871,2 +1044,4 @@
}
func shouldRedial(ctx context.Context, err error) bool {
if err == nil || errors.Is(err, ErrNodeAccessDenied) || errors.Is(err, ErrNodeNotFound) || errors.Is(err, errNodeEmptyResult) || errors.Is(ctx.Err(), context.Canceled) {
Owner

errors.Is(ctx.Err(), context.Canceled) is unexpected to me here, because it has err != nil even though the err is not checked.
Also context.DeadlineExceeded is somewhat similar, do we need to add it here?

`errors.Is(ctx.Err(), context.Canceled)` is unexpected to me here, because it has `err != nil` even though the err is not checked. Also `context.DeadlineExceeded` is somewhat similar, do we need to add it here?
Author
Member

This check added by analogy with object pool

This check added by analogy with [object pool](https://git.frostfs.info/TrueCloudLab/frostfs-sdk-go/src/commit/dcd4ea334e068afbdb103f49a8e59afd16238c32/pool/pool.go#L1353)
Member

It seems any error from ctx.Error() make any dialing call with ctx not redialable as ctx is already Done()
ctx.Error() != nil could be enough?

It seems any error from `ctx.Error()` make any dialing call with `ctx` not redialable as `ctx` is already Done() `ctx.Error() != nil` could be enough?
Owner

If it is copypaste, then ok.

If it is copypaste, then ok.
fyrchik marked this conversation as resolved
@ -362,0 +506,4 @@
return res
}
func reorderKeys(nodeKeys []*keys.PrivateKey, cnrNodes [][]netmap.NodeInfo) []*keys.PrivateKey {
Owner

I see this function being used in tests once, but I don't understand why it is here.
We get placement policy, then container nodes, then work.
In real code this function won't exist, why is it needed for tests?

I see this function being used in tests once, but I don't understand why it is here. We get placement policy, then container nodes, then work. In real code this function won't exist, why is it needed for tests?
Author
Member

Because after we get placement vector, nodes can be reordered. It's used to reorder node keys to set errors in tests

Because after we get placement vector, nodes can be reordered. It's used to reorder node keys to set errors in tests
Owner

You have nodeKeys []*keys.PrivateKey, but it seems the only way you use it is to sift it through HRW hash and filter client nodes. Now the question: can we omit this array completely?
It will be makeClientMap([]netmap.NodeInfo), checkClientUsage(t, p, ...netmap.NodeInfo) etc.
I see no need in private keys currently.

You have `nodeKeys []*keys.PrivateKey`, but it seems the only way you use it is to sift it through HRW hash and filter client nodes. Now the question: can we omit this array completely? It will be `makeClientMap([]netmap.NodeInfo)`, `checkClientUsage(t, p, ...netmap.NodeInfo)` etc. I see no need in _private_ keys currently.
fyrchik marked this conversation as resolved
aarifullin reviewed 2024-12-17 10:24:21 +00:00
@ -869,0 +969,4 @@
if cl, err = treeCl.serviceClient(); err == nil {
err = fn(cl)
}
if shouldRedial(ctx, err) {
Member

Let me highlight some points about this method

Are you sure we need to check ctx within this invocation? It looks like you are trying to check the error within client call out of client call. I mean err will be assigned with context.Canceled or with context.DeadlineExceeded anyway after err = fn(cl). I believe to check ctx.Err() != nil before getNewTreeClient to avoid invocation that's definitely will be failed as context is done.
So, currently, I don't see any reason to check ctx within shouldRedial.

Also, the name shouldRedial is a bit complicating. WDYT, probably, isRedialableError could be better for this purpose?

Let me highlight some points about this method Are you sure we need to check `ctx` within this invocation? It looks like you are trying to check the error within client call out of client call. I mean `err` will be assigned with `context.Canceled` or with `context.DeadlineExceeded` anyway after `err = fn(cl)`. I believe to check `ctx.Err() != nil` before `getNewTreeClient` to avoid invocation that's definitely will be failed as context is done. So, currently, I don't see any reason to check ctx within `shouldRedial`. Also, the name `shouldRedial` is a bit complicating. WDYT, probably, `isRedialableError` could be better for this purpose?
Author
Member

Why err will be assigned with context.Canceled or with context.DeadlineExceeded anyway after err = fn(cl)?

Why err will be assigned with `context.Canceled` or with `context.DeadlineExceeded` anyway after `err = fn(cl)`?
Member

I suppose the context is going to be propagated through functor

err := p.requestWithRetry(ctx, prm.CID, func(client *rpcclient.Client) (inErr error) {
		resp, inErr = rpcapi.Add(client, request, rpcclient.WithContext(ctx))
		return handleError("failed to add node", inErr)
})

But, anyway, I agree we should check the error ctx.Err() but not within shouldRedial and before getNewTreeClient

I suppose the context is going to be propagated through functor ```go err := p.requestWithRetry(ctx, prm.CID, func(client *rpcclient.Client) (inErr error) { resp, inErr = rpcapi.Add(client, request, rpcclient.WithContext(ctx)) return handleError("failed to add node", inErr) }) ``` But, anyway, I agree we should check the error `ctx.Err()` but not within `shouldRedial` and before `getNewTreeClient`
Member

Okay, let me clarify what I really meant:

  1. err can be assigned with context cancellation errors as context is propagated through functor but that may happen on stream opening, not on Send/Recv messages
  2. From my POV: shouldRedial shouldn't check context error. Instread, this can be checked separately before the instantiation of a tree client
  3. shouldRedial is better to be renamed to isRedialableError :)
Okay, let me clarify what I really meant: 1. `err` can be assigned with context cancellation errors as context is propagated through functor but that may happen on stream opening, not on `Send/Recv` messages 2. From my POV: `shouldRedial` shouldn't check context error. Instread, this can be checked separately before the instantiation of a tree client 3. `shouldRedial` is better to be renamed to `isRedialableError` :)
Author
Member

I think redialable is not quite appropriate. The purpose of this function is to check whether error expected or connection to node should be established again next time.

I think redialable is not quite appropriate. The purpose of this function is to check whether error expected or connection to node should be established again next time.
Member

Okay, it's up to you :)

Okay, it's up to you :)
aarifullin marked this conversation as resolved
mbiryukova force-pushed feature/tree_pool_priority from aa9d8ae72a to ce16b90b5f 2024-12-17 13:38:20 +00:00 Compare
mbiryukova dismissed alexvanin's review 2024-12-17 13:38:21 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

mbiryukova dismissed dkirillov's review 2024-12-17 13:38:21 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

alexvanin reviewed 2024-12-18 06:20:05 +00:00
@ -0,0 +1,112 @@
// NOTE: code is taken from https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/network/address.go
Owner

It's better to link exact revision instead of a branch.

https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/7e3f2fca47bac063ced310ef4fd8042e32dd0c80/pkg/network/address.go
It's better to link exact revision instead of a branch. ``` https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/7e3f2fca47bac063ced310ef4fd8042e32dd0c80/pkg/network/address.go ```
alexvanin approved these changes 2024-12-18 06:22:54 +00:00
Dismissed
alexvanin requested review from fyrchik 2024-12-18 06:24:00 +00:00
mbiryukova force-pushed feature/tree_pool_priority from ce16b90b5f to 42a0fc8c13 2024-12-18 06:24:50 +00:00 Compare
mbiryukova dismissed alexvanin's review 2024-12-18 06:24:50 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

fyrchik approved these changes 2024-12-18 06:25:10 +00:00
alexvanin approved these changes 2024-12-18 06:50:28 +00:00
alexvanin merged commit 42a0fc8c13 into master 2024-12-18 06:51:00 +00:00
alexvanin deleted branch feature/tree_pool_priority 2024-12-18 06:51:30 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
6 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-sdk-go#305
No description provided.