Async evacuate #329

Merged
fyrchik merged 4 commits from dstepanov-yadro/frostfs-node:feat/async-evacuate into master 2023-05-19 08:43:54 +00:00

Closes #109

Closes #109
dstepanov-yadro force-pushed feat/async-evacuate from 8d25593d44 to 9822a7835b 2023-05-05 14:36:21 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 9822a7835b to 9f0b4d7125 2023-05-05 14:56:15 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 9f0b4d7125 to 24cf2a2fff 2023-05-05 15:11:11 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 24cf2a2fff to eb1937eb1e 2023-05-10 06:43:16 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from eb1937eb1e to 4964fa434b 2023-05-10 11:04:45 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 4964fa434b to 44a3bec8ee 2023-05-10 11:09:49 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 44a3bec8ee to e13ec5c662 2023-05-10 13:20:45 +00:00 Compare
dstepanov-yadro requested review from storage-core-committers 2023-05-10 13:24:27 +00:00
dstepanov-yadro requested review from storage-core-developers 2023-05-10 13:24:27 +00:00
dstepanov-yadro changed title from WIP: Async evacuate to Async evacuate 2023-05-10 13:24:35 +00:00
dstepanov-yadro force-pushed feat/async-evacuate from e13ec5c662 to 6d3f2a4670 2023-05-15 09:17:12 +00:00 Compare
aarifullin reviewed 2023-05-15 09:37:15 +00:00
@ -0,0 +21,4 @@
prm.WithShardIDList(s.getShardIDList(req.GetBody().GetShard_ID()))
prm.WithIgnoreErrors(req.GetBody().GetIgnoreErrors())
prm.WithFaultHandler(s.replicate)
prm.WithAsync(true)
Member

For me it's obvios that WithAsync sets async flag and there's no need to pass the boolean argument - just WithAsync() - WDYT?

For me it's obvios that `WithAsync` sets `async` flag and there's no need to pass the boolean argument - just `WithAsync()` - WDYT?
Owner

It is usually more convenient because:

  1. Easier to pass down the stream (WithAsync(async) instead of if async { WithAsync }.
  2. Easier to override (e.g. we have defaults in tests, but don't want Async for one specific test).
It is usually more convenient because: 1. Easier to pass down the stream (`WithAsync(async)` instead of `if async { WithAsync }`. 2. Easier to override (e.g. we have defaults in tests, but don't want Async for one specific test).
aarifullin reviewed 2023-05-15 09:45:12 +00:00
@ -65,2 +117,3 @@
// The shard being moved must be in read-only mode.
func (e *StorageEngine) Evacuate(ctx context.Context, prm EvacuateShardPrm) (EvacuateShardRes, error) {
func (e *StorageEngine) Evacuate(ctx context.Context, prm EvacuateShardPrm) (*EvacuateShardRes, error) {
select {
Member

Could you explain, please, why have you put these select-s at the beginning of the methods?
For me this select looks inconvinient. This check should be performed by the code that runs something asynchroniously (like the errgroup below)

Could you explain, please, why have you put these `select`-s at the beginning of the methods? For me this `select` looks inconvinient. This check should be performed by the code that runs something asynchroniously (like the errgroup below)
Author
Member

RPC call has deadline, so <-ctx.Done() can be true.

RPC call has deadline, so `<-ctx.Done()` can be true.
Member

can be true

No doubt.

But why don't we rely on the context cancellation on errgroup.Go invocation level?

> can be true No doubt. But why don't we rely on the context cancellation on `errgroup.Go` invocation level?
Author
Member

errGroup can use detached context (context.Background()) in case of async execution, so this check will be the only one for ctx.Done

errGroup can use detached context (context.Background()) in case of async execution, so this check will be the only one for ctx.Done
fyrchik reviewed 2023-05-15 16:24:10 +00:00
@ -80,3 +83,3 @@
t.Parallel()
const objPerShard = 3
var objPerShard uint32 = 3
Owner

But why?

But why?
Author
Member

fixed

fixed
fyrchik marked this conversation as resolved
@ -262,0 +316,4 @@
prm.shardID = ids[1:2]
prm.handler = func(ctx context.Context, a oid.Address, o *objectSDK.Object) error {
running.Store(true)
for !blocker.Load() {
Owner

We set it only one time, what about closing the channel instead?

We set it only one time, what about closing the channel instead?
Author
Member

fixed

fixed
fyrchik marked this conversation as resolved
dstepanov-yadro force-pushed feat/async-evacuate from 6d3f2a4670 to 1bb04aa03a 2023-05-16 07:20:05 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 1bb04aa03a to 785d58e76d 2023-05-16 07:25:20 +00:00 Compare
realloc reviewed 2023-05-16 12:51:15 +00:00
@ -0,0 +6,4 @@
To start the evacuation, it is necessary that the shard is in read-only mode (read more [here](./shard-modes.md)).
First of all, by the evacuation the data is transferred to other shards of the same node; if it is not possible, then the data is transferred to other nodes.
Owner

Is it true, that if the one wants to migrate all the data out of a storage node, she needs to add all shards with a shardAllFlag toggled? If so, can we emphasize it in documentation?

Is it true, that if the one wants to migrate all the data out of a storage node, she needs to add all shards with a `shardAllFlag` toggled? If so, can we emphasize it in documentation?
Author
Member

Added to Commands section.

Added to `Commands` section.
acid-ant approved these changes 2023-05-16 13:13:53 +00:00
@ -0,0 +135,4 @@
const reportIntervalSeconds = 5
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})
Member

Is it possible to use any here and below?

Is it possible to use `any` here and below?
Author
Member

fixed to struct{}

fixed to struct{}
@ -86,0 +166,4 @@
}
err = eg.Wait()
return res, err
Member

Is it possible to squash it in one line?

Is it possible to squash it in one line?
Author
Member

fixed.

fixed.
@ -0,0 +36,4 @@
}
}
var duration *control.GetEvacuateShardStatusResponse_Body_Duration
if state.StartedAt() != nil {
Member

Is it possible to use one if statement here and above?

Is it possible to use one `if` statement here and above?
Author
Member

It is possible. But first if is for startedAt value, and second if for duration value. It's easier for me this way, i'm old.

It is possible. But first `if` is for `startedAt` value, and second `if` for `duration` value. It's easier for me this way, i'm old.
fyrchik reviewed 2023-05-16 13:18:23 +00:00
@ -0,0 +135,4 @@
const reportIntervalSeconds = 5
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})
Owner

pooling or polling?

pooling or polling?
Author
Member

polling. fixed.

polling. fixed.
fyrchik marked this conversation as resolved
@ -0,0 +136,4 @@
var resp *control.GetEvacuateShardStatusResponse
reportResponse := atomic.NewPointer(resp)
poolingCompleted := make(chan interface{})
progressReportCompleted := make(chan interface{})
Owner

Also, why interface{} and not struct{}? It seems we only close it.

Also, why `interface{}` and not `struct{}`? It seems we only close it.
Author
Member

Ok, struct{}. Fixed.

Ok, struct{}. Fixed.
fyrchik marked this conversation as resolved
@ -0,0 +138,4 @@
poolingCompleted := make(chan interface{})
progressReportCompleted := make(chan interface{})
go func() {
Owner

Why do we need a goroutine here? It seems we already sleep in the main loop.

Why do we need a goroutine here? It seems we already sleep in the main loop.
Author
Member

Goroutine prints report every N seconds. Main loop makes request and sleeps.

Goroutine prints report every N seconds. Main loop makes request and sleeps.
fyrchik marked this conversation as resolved
@ -0,0 +95,4 @@
return s.errMessage
}
func (s *EvacuationState) NextTryAfterSeconds() int64 {
Owner

Do we need to specify client polling interval here in the engine?

Do we need to specify _client_ polling interval here in the engine?
Author
Member

To reduce the load, the server can increase this interval. But now it's constant.
Inspired by OAuth2 device code flow: https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-device-code#device-authorization-response

To reduce the load, the server can increase this interval. But now it's constant. Inspired by OAuth2 device code flow: https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-device-code#device-authorization-response
Author
Member

Ok, dropped

Ok, dropped
fyrchik marked this conversation as resolved
@ -0,0 +103,4 @@
if s == nil {
return nil
}
shardIDs := make([]string, len(s.shardIDs))
Owner

Isn't it read-only?

Isn't it read-only?
Author
Member

It is. But since the method is called DeepCopy, then the copy must be deep.

It is. But since the method is called DeepCopy, then the copy must be deep.
fyrchik marked this conversation as resolved
@ -0,0 +170,4 @@
l.guard.RLock()
defer l.guard.RUnlock()
return l.state.DeepCopy()
Owner

Why do we need deepcopy here? Pointers to atomics are ok to copy.

Why do we need deepcopy here? Pointers to atomics are ok to copy.
Author
Member

There were two ways to ensure consistency: mutex inside the structure or deep copy. I chose the second one. This is my engineering decision.

There were two ways to ensure consistency: mutex inside the structure or deep copy. I chose the second one. This is my engineering decision.
Owner

Could you elaborate, what consistency issues do we have if we do not do a deep copy?

Could you elaborate, what consistency issues do we have if we do not do a deep copy?
Author
Member
func (l *evacuationLimiter) Complete(err error) {
	l.guard.Lock()
	defer l.guard.Unlock()

	errMsq := ""
	if err != nil {
		errMsq = err.Error()
	}
	l.state.processState = EvacuateProcessStateCompleted
	l.state.errMessage = errMsq
	l.state.finishedAt = time.Now().UTC()

	l.eg = nil
}

func (l *evacuationLimiter) GetState() *EvacuationState {
	l.guard.RLock()
	defer l.guard.RUnlock()

	return l.state.DeepCopy()
}

state can change by evacuation goroutine, so we can get completed state without error for example.

``` func (l *evacuationLimiter) Complete(err error) { l.guard.Lock() defer l.guard.Unlock() errMsq := "" if err != nil { errMsq = err.Error() } l.state.processState = EvacuateProcessStateCompleted l.state.errMessage = errMsq l.state.finishedAt = time.Now().UTC() l.eg = nil } func (l *evacuationLimiter) GetState() *EvacuationState { l.guard.RLock() defer l.guard.RUnlock() return l.state.DeepCopy() } ``` `state` can change by evacuation goroutine, so we can get completed state without error for example.
@ -30,2 +31,4 @@
rpc EvacuateShard (EvacuateShardRequest) returns (EvacuateShardResponse);
// StartEvacuateShard starts moving all data from one shard to the others.
rpc StartEvacuateShard (StartEvacuateShardRequest) returns (StartEvacuateShardResponse);
Owner

Why is it evacuation in CLI and evacuate here?

Why is it `evacuation` in CLI and `evacuate` here?
Author
Member

fixed

fixed
fyrchik marked this conversation as resolved
dstepanov-yadro force-pushed feat/async-evacuate from 785d58e76d to 6544054ae0 2023-05-16 13:40:20 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 6544054ae0 to 3c9399c5c8 2023-05-16 13:56:06 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 3c9399c5c8 to 7925476a54 2023-05-16 14:16:53 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 7925476a54 to 96ee9c1475 2023-05-16 14:18:11 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 96ee9c1475 to 84ce746fcc 2023-05-16 14:23:59 +00:00 Compare
aarifullin approved these changes 2023-05-16 14:57:22 +00:00
dstepanov-yadro force-pushed feat/async-evacuate from 84ce746fcc to 3aff900769 2023-05-17 08:20:49 +00:00 Compare
dstepanov-yadro force-pushed feat/async-evacuate from 3aff900769 to 363105c8d3 2023-05-18 08:38:03 +00:00 Compare
fyrchik approved these changes 2023-05-19 08:43:45 +00:00
fyrchik merged commit 483fac03d6 into master 2023-05-19 08:43:54 +00:00
fyrchik referenced this pull request from a commit 2023-05-19 08:43:54 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#329
No description provided.