Async evacuate #329

Merged
fyrchik merged 4 commits from dstepanov-yadro/frostfs-node:feat/async-evacuate into master 2023-05-19 08:43:54 +00:00

Closes #109

Closes #109
dstepanov-yadro force-pushed feat/async-evacuate from 8d25593d44 to 9822a7835b 2023-05-05 14:36:21 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 9822a7835b to 9f0b4d7125 2023-05-05 14:56:15 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 9f0b4d7125 to 24cf2a2fff 2023-05-05 15:11:11 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 24cf2a2fff to eb1937eb1e 2023-05-10 06:43:16 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from eb1937eb1e to 4964fa434b 2023-05-10 11:04:45 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 4964fa434b to 44a3bec8ee 2023-05-10 11:09:49 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 44a3bec8ee to e13ec5c662 2023-05-10 13:20:45 +00:00 Compare
dstepanov-yadro requested review from storage-core-committers 2023-05-10 13:24:27 +00:00
dstepanov-yadro requested review from storage-core-developers 2023-05-10 13:24:27 +00:00
dstepanov-yadro changed title from WIP: Async evacuate to Async evacuate 2023-05-10 13:24:35 +00:00
dstepanov-yadro force-pushed feat/async-evacuate from e13ec5c662 to 6d3f2a4670 2023-05-15 09:17:12 +00:00 Compare
aarifullin reviewed 2023-05-15 09:37:15 +00:00
@ -0,0 +21,4 @@
prm.WithShardIDList(s.getShardIDList(req.GetBody().GetShard_ID()))
prm.WithIgnoreErrors(req.GetBody().GetIgnoreErrors())
prm.WithFaultHandler(s.replicate)
prm.WithAsync(true)
Collaborator

For me it's obvios that WithAsync sets async flag and there's no need to pass the boolean argument - just WithAsync() - WDYT?

For me it's obvios that `WithAsync` sets `async` flag and there's no need to pass the boolean argument - just `WithAsync()` - WDYT?

It is usually more convenient because:

  1. Easier to pass down the stream (WithAsync(async) instead of if async { WithAsync }.
  2. Easier to override (e.g. we have defaults in tests, but don't want Async for one specific test).
It is usually more convenient because: 1. Easier to pass down the stream (`WithAsync(async)` instead of `if async { WithAsync }`. 2. Easier to override (e.g. we have defaults in tests, but don't want Async for one specific test).
aarifullin reviewed 2023-05-15 09:45:12 +00:00
@ -65,2 +117,3 @@
// The shard being moved must be in read-only mode.
func (e *StorageEngine) Evacuate(ctx context.Context, prm EvacuateShardPrm) (EvacuateShardRes, error) {
func (e *StorageEngine) Evacuate(ctx context.Context, prm EvacuateShardPrm) (*EvacuateShardRes, error) {
select {
Collaborator

Could you explain, please, why have you put these select-s at the beginning of the methods?
For me this select looks inconvinient. This check should be performed by the code that runs something asynchroniously (like the errgroup below)

Could you explain, please, why have you put these `select`-s at the beginning of the methods? For me this `select` looks inconvinient. This check should be performed by the code that runs something asynchroniously (like the errgroup below)
Poster
Collaborator

RPC call has deadline, so <-ctx.Done() can be true.

RPC call has deadline, so `<-ctx.Done()` can be true.
Collaborator

can be true

No doubt.

But why don't we rely on the context cancellation on errgroup.Go invocation level?

> can be true No doubt. But why don't we rely on the context cancellation on `errgroup.Go` invocation level?
Poster
Collaborator

errGroup can use detached context (context.Background()) in case of async execution, so this check will be the only one for ctx.Done

errGroup can use detached context (context.Background()) in case of async execution, so this check will be the only one for ctx.Done
fyrchik reviewed 2023-05-15 16:24:10 +00:00
@ -80,3 +83,3 @@
t.Parallel()
const objPerShard = 3
var objPerShard uint32 = 3

But why?

But why?
Poster
Collaborator

fixed

fixed
fyrchik marked this conversation as resolved
@ -262,0 +316,4 @@
prm.shardID = ids[1:2]
prm.handler = func(ctx context.Context, a oid.Address, o *objectSDK.Object) error {
running.Store(true)
for !blocker.Load() {

We set it only one time, what about closing the channel instead?

We set it only one time, what about closing the channel instead?
Poster
Collaborator

fixed

fixed
fyrchik marked this conversation as resolved
dstepanov-yadro force-pushed feat/async-evacuate from 6d3f2a4670 to 1bb04aa03a 2023-05-16 07:20:05 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 1bb04aa03a to 785d58e76d 2023-05-16 07:25:20 +00:00 Compare
realloc reviewed 2023-05-16 12:51:15 +00:00
@ -0,0 +6,4 @@
To start the evacuation, it is necessary that the shard is in read-only mode (read more [here](./shard-modes.md)).
First of all, by the evacuation the data is transferred to other shards of the same node; if it is not possible, then the data is transferred to other nodes.

Is it true, that if the one wants to migrate all the data out of a storage node, she needs to add all shards with a shardAllFlag toggled? If so, can we emphasize it in documentation?

Is it true, that if the one wants to migrate all the data out of a storage node, she needs to add all shards with a `shardAllFlag` toggled? If so, can we emphasize it in documentation?
Poster
Collaborator

Added to Commands section.

Added to `Commands` section.
acid-ant approved these changes 2023-05-16 13:13:53 +00:00
@ -0,0 +135,4 @@
const reportIntervalSeconds = 5
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})
Collaborator

Is it possible to use any here and below?

Is it possible to use `any` here and below?
Poster
Collaborator

fixed to struct{}

fixed to struct{}
@ -86,0 +166,4 @@
}
err = eg.Wait()
return res, err
Collaborator

Is it possible to squash it in one line?

Is it possible to squash it in one line?
Poster
Collaborator

fixed.

fixed.
@ -0,0 +36,4 @@
}
}
var duration *control.GetEvacuateShardStatusResponse_Body_Duration
if state.StartedAt() != nil {
Collaborator

Is it possible to use one if statement here and above?

Is it possible to use one `if` statement here and above?
Poster
Collaborator

It is possible. But first if is for startedAt value, and second if for duration value. It's easier for me this way, i'm old.

It is possible. But first `if` is for `startedAt` value, and second `if` for `duration` value. It's easier for me this way, i'm old.
fyrchik reviewed 2023-05-16 13:18:23 +00:00
@ -0,0 +135,4 @@
const reportIntervalSeconds = 5
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})

pooling or polling?

pooling or polling?
Poster
Collaborator

polling. fixed.

polling. fixed.
fyrchik marked this conversation as resolved
@ -0,0 +136,4 @@
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})
progressReportCompleted := make(chan interface{})

Also, why interface{} and not struct{}? It seems we only close it.

Also, why `interface{}` and not `struct{}`? It seems we only close it.
Poster
Collaborator

Ok, struct{}. Fixed.

Ok, struct{}. Fixed.
fyrchik marked this conversation as resolved
@ -0,0 +138,4 @@
poolingCompleted := make(chan interface{})
progressReportCompleted := make(chan interface{})
go func() {

Why do we need a goroutine here? It seems we already sleep in the main loop.

Why do we need a goroutine here? It seems we already sleep in the main loop.
Poster
Collaborator

Goroutine prints report every N seconds. Main loop makes request and sleeps.

Goroutine prints report every N seconds. Main loop makes request and sleeps.
fyrchik marked this conversation as resolved
@ -0,0 +95,4 @@
return s.errMessage
}
func (s *EvacuationState) NextTryAfterSeconds() int64 {

Do we need to specify client polling interval here in the engine?

Do we need to specify _client_ polling interval here in the engine?
Poster
Collaborator

To reduce the load, the server can increase this interval. But now it's constant.
Inspired by OAuth2 device code flow: https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-device-code#device-authorization-response

To reduce the load, the server can increase this interval. But now it's constant. Inspired by OAuth2 device code flow: https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-device-code#device-authorization-response
Poster
Collaborator

Ok, dropped

Ok, dropped
fyrchik marked this conversation as resolved
@ -0,0 +103,4 @@
if s == nil {
return nil
}
shardIDs := make([]string, len(s.shardIDs))

Isn't it read-only?

Isn't it read-only?
Poster
Collaborator

It is. But since the method is called DeepCopy, then the copy must be deep.

It is. But since the method is called DeepCopy, then the copy must be deep.
fyrchik marked this conversation as resolved
@ -0,0 +170,4 @@
l.guard.RLock()
defer l.guard.RUnlock()
return l.state.DeepCopy()

Why do we need deepcopy here? Pointers to atomics are ok to copy.

Why do we need deepcopy here? Pointers to atomics are ok to copy.
Poster
Collaborator

There were two ways to ensure consistency: mutex inside the structure or deep copy. I chose the second one. This is my engineering decision.

There were two ways to ensure consistency: mutex inside the structure or deep copy. I chose the second one. This is my engineering decision.

Could you elaborate, what consistency issues do we have if we do not do a deep copy?

Could you elaborate, what consistency issues do we have if we do not do a deep copy?
Poster
Collaborator
func (l *evacuationLimiter) Complete(err error) {
	l.guard.Lock()
	defer l.guard.Unlock()

	errMsq := ""
	if err != nil {
		errMsq = err.Error()
	}
	l.state.processState = EvacuateProcessStateCompleted
	l.state.errMessage = errMsq
	l.state.finishedAt = time.Now().UTC()

	l.eg = nil
}

func (l *evacuationLimiter) GetState() *EvacuationState {
	l.guard.RLock()
	defer l.guard.RUnlock()

	return l.state.DeepCopy()
}

state can change by evacuation goroutine, so we can get completed state without error for example.

``` func (l *evacuationLimiter) Complete(err error) { l.guard.Lock() defer l.guard.Unlock() errMsq := "" if err != nil { errMsq = err.Error() } l.state.processState = EvacuateProcessStateCompleted l.state.errMessage = errMsq l.state.finishedAt = time.Now().UTC() l.eg = nil } func (l *evacuationLimiter) GetState() *EvacuationState { l.guard.RLock() defer l.guard.RUnlock() return l.state.DeepCopy() } ``` `state` can change by evacuation goroutine, so we can get completed state without error for example.
@ -30,2 +31,4 @@
rpc EvacuateShard (EvacuateShardRequest) returns (EvacuateShardResponse);
// StartEvacuateShard starts moving all data from one shard to the others.
rpc StartEvacuateShard (StartEvacuateShardRequest) returns (StartEvacuateShardResponse);

Why is it evacuation in CLI and evacuate here?

Why is it `evacuation` in CLI and `evacuate` here?
Poster
Collaborator

fixed

fixed
fyrchik marked this conversation as resolved
dstepanov-yadro force-pushed feat/async-evacuate from 785d58e76d to 6544054ae0 2023-05-16 13:40:20 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 6544054ae0 to 3c9399c5c8 2023-05-16 13:56:06 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 3c9399c5c8 to 7925476a54 2023-05-16 14:16:53 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 7925476a54 to 96ee9c1475 2023-05-16 14:18:11 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 96ee9c1475 to 84ce746fcc 2023-05-16 14:23:59 +00:00 Compare
aarifullin approved these changes 2023-05-16 14:57:22 +00:00
dstepanov-yadro force-pushed feat/async-evacuate from 84ce746fcc to 3aff900769 2023-05-17 08:20:49 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 3aff900769 to 363105c8d3 2023-05-18 08:38:03 +00:00 Compare
fyrchik approved these changes 2023-05-19 08:43:45 +00:00
fyrchik merged commit 483fac03d6 into master 2023-05-19 08:43:54 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#329
There is no content yet.