Async evacuate #329
No reviewers
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#329
Loading…
Reference in a new issue
No description provided.
Delete branch "dstepanov-yadro/frostfs-node:feat/async-evacuate"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #109
8d25593d44
to9822a7835b
9822a7835b
to9f0b4d7125
9f0b4d7125
to24cf2a2fff
24cf2a2fff
toeb1937eb1e
eb1937eb1e
to4964fa434b
4964fa434b
to44a3bec8ee
44a3bec8ee
toe13ec5c662
WIP: Async evacuateto Async evacuatee13ec5c662
to6d3f2a4670
@ -0,0 +21,4 @@
prm.WithShardIDList(s.getShardIDList(req.GetBody().GetShard_ID()))
prm.WithIgnoreErrors(req.GetBody().GetIgnoreErrors())
prm.WithFaultHandler(s.replicate)
prm.WithAsync(true)
For me it's obvios that
WithAsync
setsasync
flag and there's no need to pass the boolean argument - justWithAsync()
- WDYT?It is usually more convenient because:
WithAsync(async)
instead ofif async { WithAsync }
.@ -65,2 +117,3 @@
// The shard being moved must be in read-only mode.
func (e *StorageEngine) Evacuate(ctx context.Context, prm EvacuateShardPrm) (EvacuateShardRes, error) {
func (e *StorageEngine) Evacuate(ctx context.Context, prm EvacuateShardPrm) (*EvacuateShardRes, error) {
select {
Could you explain, please, why have you put these
select
-s at the beginning of the methods?For me this
select
looks inconvinient. This check should be performed by the code that runs something asynchroniously (like the errgroup below)RPC call has deadline, so
<-ctx.Done()
can be true.No doubt.
But why don't we rely on the context cancellation on
errgroup.Go
invocation level?errGroup can use detached context (context.Background()) in case of async execution, so this check will be the only one for ctx.Done
@ -80,3 +83,3 @@
t.Parallel()
const objPerShard = 3
var objPerShard uint32 = 3
But why?
fixed
@ -262,0 +316,4 @@
prm.shardID = ids[1:2]
prm.handler = func(ctx context.Context, a oid.Address, o *objectSDK.Object) error {
running.Store(true)
for !blocker.Load() {
We set it only one time, what about closing the channel instead?
fixed
6d3f2a4670
to1bb04aa03a
1bb04aa03a
to785d58e76d
@ -0,0 +6,4 @@
To start the evacuation, it is necessary that the shard is in read-only mode (read more [here](./shard-modes.md)).
First of all, by the evacuation the data is transferred to other shards of the same node; if it is not possible, then the data is transferred to other nodes.
Is it true, that if the one wants to migrate all the data out of a storage node, she needs to add all shards with a
shardAllFlag
toggled? If so, can we emphasize it in documentation?Added to
Commands
section.@ -0,0 +135,4 @@
const reportIntervalSeconds = 5
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})
Is it possible to use
any
here and below?fixed to struct{}
@ -86,0 +166,4 @@
}
err = eg.Wait()
return res, err
Is it possible to squash it in one line?
fixed.
@ -0,0 +36,4 @@
}
}
var duration *control.GetEvacuateShardStatusResponse_Body_Duration
if state.StartedAt() != nil {
Is it possible to use one
if
statement here and above?It is possible. But first
if
is forstartedAt
value, and secondif
forduration
value. It's easier for me this way, i'm old.@ -0,0 +135,4 @@
const reportIntervalSeconds = 5
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})
pooling or polling?
polling. fixed.
@ -0,0 +136,4 @@
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})
progressReportCompleted := make(chan interface{})
Also, why
interface{}
and notstruct{}
? It seems we only close it.Ok, struct{}. Fixed.
@ -0,0 +138,4 @@
poolingCompleted := make(chan interface{})
progressReportCompleted := make(chan interface{})
go func() {
Why do we need a goroutine here? It seems we already sleep in the main loop.
Goroutine prints report every N seconds. Main loop makes request and sleeps.
@ -0,0 +95,4 @@
return s.errMessage
}
func (s *EvacuationState) NextTryAfterSeconds() int64 {
Do we need to specify client polling interval here in the engine?
To reduce the load, the server can increase this interval. But now it's constant.
Inspired by OAuth2 device code flow: https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-device-code#device-authorization-response
Ok, dropped
@ -0,0 +103,4 @@
if s == nil {
return nil
}
shardIDs := make([]string, len(s.shardIDs))
Isn't it read-only?
It is. But since the method is called DeepCopy, then the copy must be deep.
@ -0,0 +170,4 @@
l.guard.RLock()
defer l.guard.RUnlock()
return l.state.DeepCopy()
Why do we need deepcopy here? Pointers to atomics are ok to copy.
There were two ways to ensure consistency: mutex inside the structure or deep copy. I chose the second one. This is my engineering decision.
Could you elaborate, what consistency issues do we have if we do not do a deep copy?
state
can change by evacuation goroutine, so we can get completed state without error for example.@ -30,2 +31,4 @@
rpc EvacuateShard (EvacuateShardRequest) returns (EvacuateShardResponse);
// StartEvacuateShard starts moving all data from one shard to the others.
rpc StartEvacuateShard (StartEvacuateShardRequest) returns (StartEvacuateShardResponse);
Why is it
evacuation
in CLI andevacuate
here?fixed
785d58e76d
to6544054ae0
6544054ae0
to3c9399c5c8
3c9399c5c8
to7925476a54
7925476a54
to96ee9c1475
96ee9c1475
to84ce746fcc
84ce746fcc
to3aff900769
control shards evacuate
#1093aff900769
to363105c8d3