[#305] tree/pool: Add flag to use net map to prioritize tree services #305
Labels
No labels
P0
P1
P2
P3
good first issue
pool
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
6 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-sdk-go#305
Loading…
Reference in a new issue
No description provided.
Delete branch "mbiryukova/frostfs-sdk-go:feature/tree_pool_priority"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
New version of tree/pool selects tree service connection to make request based on the current net map and container placement policy
Signed-off-by: Marina Biryukova m.biryukova@yadro.com
4d14c53489
toe25995e12a
[#xxx] tree/pool: Use net map for prioritize tree servicesto [#305] tree/pool: Use net map to prioritize tree services@ -3,6 +3,7 @@ module git.frostfs.info/TrueCloudLab/frostfs-sdk-go
go 1.22
require (
git.frostfs.info/TrueCloudLab/frostfs-api-go/v2 v2.16.0
We already merge
sdk-go
andapi-go
here #281Looks like this new dependency is redundant.
e25995e12a
toa272af911e
@ -0,0 +1,110 @@
package network
It this file is just copy one from storage
7e542906ef/pkg/network/address.go
should we place it to more common package rather thanpool/tree/network
? For example topkg/network
or justnetwork
@ -725,4 +731,2 @@
}
// startRebalance runs loop to monitor tree client healthy status.
func (p *Pool) startRebalance(ctx context.Context) {
Why do we drop rebalancing?
@ -76,0 +81,4 @@
}
type PlacementPolicySource interface {
PlacementPolicy(cnrID cid.ID) (netmap.PlacementPolicy, bool)
Why isn't this method symmetrical to
NetMapSnapshot
?I mean why do we try to get placement policy in tree pool (see
Pool.getContainerPlacementPolicy
) ifPlacementPolicySource
couldn't provide it to us?In most cases placement policy will be provided from s3-gw cache but in other cases this information should be requested from storage. Maybe we should do the same for netmap, but I'm not sure
@ -844,1 +774,3 @@
if cl, err = p.innerPools[indexI].clients[indexJ].serviceClient(); err == nil {
treeCl, ok := p.clientMap[cnrNode.Hash()]
if !ok {
treeCl, err = p.getNewTreeClient(ctx, cnrNode)
Shouldn't we respect this
p.innerPools
?p.innerPools
now doesn't consist of tree service clients, if I understand you correctlya272af911e
to00c96617c2
[#305] tree/pool: Use net map to prioritize tree servicesto [#305] tree/pool: Add flag to use net map to prioritize tree services00c96617c2
toc50f6022ba
@ -75,1 +77,4 @@
maxRequestAttempts int
netMapSource NetMapSource
placementPolicySource PlacementPolicySource
v2 bool
Can we use more meaningful name for this param?
@ -76,0 +84,4 @@
NetMapSnapshot(ctx context.Context) (netmap.NetMap, error)
}
type PlacementPolicySource interface {
question/suggestion: Is it worth to union these two interfaces to one?
Maybe then we don't need
v2
field at allI made one interface and dropped
v2
field@ -253,3 +288,2 @@
func (p *Pool) Dial(ctx context.Context) error {
inner := make([]*innerPool, len(p.rebalanceParams.nodesGroup))
var atLeastOneHealthy bool
if !p.v2 {
nitpick: It's be more easy to read if we write
or extract dial for "v1" to separate function
@ -334,6 +372,21 @@ func (x *InitParameters) SetMaxRequestAttempts(maxAttempts int) {
x.maxRequestAttempts = maxAttempts
}
// UseV2 sets flag for using net map to prioritize tree services.
We should mention that if this param is set then
AddNode
won't have effect@ -337,0 +383,4 @@
}
// SetPlacementPolicySource sets implementation of interface to get container placement policy.
func (x *InitParameters) SetPlacementPolicySource(placementPolicySource PlacementPolicySource) {
Until we have
UseV2
parameter we should mention that these two will only have an effect ifUseV2
param is set.@ -866,10 +930,120 @@ LOOP:
return finErr
}
func (p *Pool) requestWithRetryContainerNodes(ctx context.Context, cid cid.ID, fn func(client *rpcclient.Client) error) error {
nitpick: Can we use variable name that don't collide with package names?
cid
->cnrID
@ -869,0 +966,4 @@
if attempts == 0 {
break LOOP
}
attempts--
I'm not sure if we should update this counter until we really we request to this client. The reason is we don't remember last healthy client and every request process
cnrNodes
in the same order. So if we have 4 node (2 of them unhealthy) and 2 max attempts we 1 or 2 epochs (until bad nodes disappear from netmap) will get errors.In previous tree pool version we don't have such delay
I agree with this statement. However I don't think we are going to have such small number of retries, but this is worth to take into account anyway.
@ -869,0 +968,4 @@
}
attempts--
treeCl, ok := p.clientMap[cnrNode.Hash()]
We must access
p.clientMap
under mutex because we update it sometimes@ -871,2 +1038,4 @@
}
func shouldRedial(err error) bool {
if err == nil || errors.Is(err, ErrNodeAccessDenied) || errors.Is(err, ErrNodeNotFound) || errors.Is(err, errNodeEmptyResult) {
Probably we should be more careful with list of error on which we do redial.
At least we we shouldn't count
context canceled
error (the similar tohandleError
function in object pool)shouldRedial
in general doesn't look very robust, but I don't have anything to suggest here, unfortunately.c50f6022ba
toaa9d8ae72a
Looks very compact for such big change, nice job! LGTM.
@ -0,0 +1,110 @@
package network
I suggest to mention in the header that this code is taken from frostfs-node
pkg/network
package. Maybe one day it could be imported back.By the way, let's import some tests too.
Refs #35 (if we can immediately reuse this PR from node, then
Close #35
)I would also like to use this across all the clients/pools, though.
There are other files in frostfs-node package, should they be added too?
If they are not needed by this PR, we may leave them for #35
@ -14,6 +14,7 @@ require (
github.com/klauspost/reedsolomon v1.12.1
github.com/mailru/easyjson v0.7.7
github.com/mr-tron/base58 v1.2.0
github.com/multiformats/go-multiaddr v0.12.1
The latest version is v0.14.0, why have you stopped on this one?
Because it's used in frostfs-node
If you are concerned about building
frostfs-node
, then it's fine - Minimal Version Selection is able to handle thisLet's use the latest one, no problem.
@ -99,1 +107,3 @@
startIndicesMtx sync.RWMutex
netMapInfoSource NetMapInfoSource
mutex sync.RWMutex
Please, make a comment about what is protected with this mutex.
@ -251,6 +269,10 @@ func NewPool(options InitParameters) (*Pool, error) {
//
// See also InitParameters.SetClientRebalanceInterval.
func (p *Pool) Dial(ctx context.Context) error {
if p.netMapInfoSource != nil {
It would've been surprising for me if I wasn't familiar with the pool internals.
What prevents us from not calling
Dial
if we providenetmapInfoSource
?Doc-comment should be changed at least.
@ -869,0 +997,4 @@
return cl, ok
}
func (p *Pool) setClientToMap(hash uint64, cl client) {
How about
addClientToMap
orsetClientInMap
?set client to map
has different meaning.@ -871,2 +1044,4 @@
}
func shouldRedial(ctx context.Context, err error) bool {
if err == nil || errors.Is(err, ErrNodeAccessDenied) || errors.Is(err, ErrNodeNotFound) || errors.Is(err, errNodeEmptyResult) || errors.Is(ctx.Err(), context.Canceled) {
errors.Is(ctx.Err(), context.Canceled)
is unexpected to me here, because it haserr != nil
even though the err is not checked.Also
context.DeadlineExceeded
is somewhat similar, do we need to add it here?This check added by analogy with object pool
It seems any error from
ctx.Error()
make any dialing call withctx
not redialable asctx
is already Done()ctx.Error() != nil
could be enough?If it is copypaste, then ok.
@ -362,0 +506,4 @@
return res
}
func reorderKeys(nodeKeys []*keys.PrivateKey, cnrNodes [][]netmap.NodeInfo) []*keys.PrivateKey {
I see this function being used in tests once, but I don't understand why it is here.
We get placement policy, then container nodes, then work.
In real code this function won't exist, why is it needed for tests?
Because after we get placement vector, nodes can be reordered. It's used to reorder node keys to set errors in tests
You have
nodeKeys []*keys.PrivateKey
, but it seems the only way you use it is to sift it through HRW hash and filter client nodes. Now the question: can we omit this array completely?It will be
makeClientMap([]netmap.NodeInfo)
,checkClientUsage(t, p, ...netmap.NodeInfo)
etc.I see no need in private keys currently.
@ -869,0 +969,4 @@
if cl, err = treeCl.serviceClient(); err == nil {
err = fn(cl)
}
if shouldRedial(ctx, err) {
Let me highlight some points about this method
Are you sure we need to check
ctx
within this invocation? It looks like you are trying to check the error within client call out of client call. I meanerr
will be assigned withcontext.Canceled
or withcontext.DeadlineExceeded
anyway aftererr = fn(cl)
. I believe to checkctx.Err() != nil
beforegetNewTreeClient
to avoid invocation that's definitely will be failed as context is done.So, currently, I don't see any reason to check ctx within
shouldRedial
.Also, the name
shouldRedial
is a bit complicating. WDYT, probably,isRedialableError
could be better for this purpose?Why err will be assigned with
context.Canceled
or withcontext.DeadlineExceeded
anyway aftererr = fn(cl)
?I suppose the context is going to be propagated through functor
But, anyway, I agree we should check the error
ctx.Err()
but not withinshouldRedial
and beforegetNewTreeClient
Okay, let me clarify what I really meant:
err
can be assigned with context cancellation errors as context is propagated through functor but that may happen on stream opening, not onSend/Recv
messagesshouldRedial
shouldn't check context error. Instread, this can be checked separately before the instantiation of a tree clientshouldRedial
is better to be renamed toisRedialableError
:)I think redialable is not quite appropriate. The purpose of this function is to check whether error expected or connection to node should be established again next time.
Okay, it's up to you :)
aa9d8ae72a
toce16b90b5f
New commits pushed, approval review dismissed automatically according to repository settings
New commits pushed, approval review dismissed automatically according to repository settings
@ -0,0 +1,112 @@
// NOTE: code is taken from https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/network/address.go
It's better to link exact revision instead of a branch.
ce16b90b5f
to42a0fc8c13
New commits pushed, approval review dismissed automatically according to repository settings