engine: Evacuate object from shards concurrently #1356

Merged
acid-ant merged 1 commit from acid-ant/frostfs-node:feat/evac-parallel into master 2024-09-24 12:02:17 +00:00
Member

Introduced two API and config parameters for evacuation - container-worker-count and object-worker-count.
All shards for evacuation will be processed concurrently.

Operation duration before changes for object size 81920 b:

$ frostfs-cli control shards evacuation start --endpoint s01.frostfs.devenv:8081 -c cfg.yml --id 9ZQneMMEc58CtT41DkVsJQ --await
...
Shard evacuation has been completed.
Shard IDs: 9ZQneMMEc58CtT41DkVsJQ. Evacuated 3351 objects out of 3351, failed to evacuate: 0, skipped: 0; evacuated 0 trees out of 0, failed to evacuate: 0. Started at: 2024-09-22T15:38:27Z UTC. Duration: 00:00:44.

Operation duration after changes:

$ frostfs-cli control shards evacuation start --endpoint s01.frostfs.devenv:8081 -c cfg.yml --id BkFt6hctFEChpSU1fT87Fi --await --container-worker-count 5 --object-worker-count 5
...
Shard evacuation has been completed.
Shard IDs: BkFt6hctFEChpSU1fT87Fi. Evacuated 3588 objects out of 3588, failed to evacuate: 0, skipped: 0; evacuated 0 trees out of 0, failed to evacuate: 0. Started at: 2024-09-22T14:36:59Z UTC. Duration: 00:00:10.

Signed-off-by: Anton Nikiforov an.nikiforov@yadro.com

Introduced two API and config parameters for evacuation - `container-worker-count` and `object-worker-count`. All shards for evacuation will be processed concurrently. Operation duration before changes for object size 81920 b: ``` $ frostfs-cli control shards evacuation start --endpoint s01.frostfs.devenv:8081 -c cfg.yml --id 9ZQneMMEc58CtT41DkVsJQ --await ... Shard evacuation has been completed. Shard IDs: 9ZQneMMEc58CtT41DkVsJQ. Evacuated 3351 objects out of 3351, failed to evacuate: 0, skipped: 0; evacuated 0 trees out of 0, failed to evacuate: 0. Started at: 2024-09-22T15:38:27Z UTC. Duration: 00:00:44. ``` Operation duration after changes: ``` $ frostfs-cli control shards evacuation start --endpoint s01.frostfs.devenv:8081 -c cfg.yml --id BkFt6hctFEChpSU1fT87Fi --await --container-worker-count 5 --object-worker-count 5 ... Shard evacuation has been completed. Shard IDs: BkFt6hctFEChpSU1fT87Fi. Evacuated 3588 objects out of 3588, failed to evacuate: 0, skipped: 0; evacuated 0 trees out of 0, failed to evacuate: 0. Started at: 2024-09-22T14:36:59Z UTC. Duration: 00:00:10. ``` Signed-off-by: Anton Nikiforov <an.nikiforov@yadro.com>
acid-ant force-pushed feat/evac-parallel from df2c5a42c4 to 15fba7a29c 2024-09-06 09:01:30 +00:00 Compare
acid-ant requested review from storage-core-committers 2024-09-06 09:01:41 +00:00
acid-ant requested review from storage-core-developers 2024-09-06 09:01:45 +00:00
aarifullin reviewed 2024-09-06 09:32:13 +00:00
@ -0,0 +2,4 @@
import "context"
type routineLimiter struct {
Member

Just for information, this will be confilcting with this PR change anyway :)

Just for information, this will be confilcting with [this](https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/1337/files#diff-02de30585d6e50435e21eeef0b676d692d342d00) PR change anyway :)
Author
Member

Oh, thanks. I'll rebase on master once it will be merged.

Oh, thanks. I'll rebase on master once it will be merged.
Author
Member

Rebased.

Rebased.
aarifullin marked this conversation as resolved
dstepanov-yadro requested changes 2024-09-06 10:45:17 +00:00
Dismissed
@ -288,2 +287,4 @@
}
limiter := newRoutineLimiter(e.cfg.evacuationWorkers)
eg, egCtx := errgroup.WithContext(ctx)

Use eg.SetLimit() instead of limiter: it does the same thing as limiter, but looks shorter.

Use `eg.SetLimit()` instead of limiter: it does the same thing as limiter, but looks shorter.
Author
Member

Thanks, updated to use eg.SetLimit().

Thanks, updated to use `eg.SetLimit()`.
acid-ant force-pushed feat/evac-parallel from 15fba7a29c to aa21a71b61 2024-09-09 08:53:08 +00:00 Compare
acid-ant requested review from dstepanov-yadro 2024-09-09 08:54:14 +00:00
fyrchik reviewed 2024-09-10 07:08:41 +00:00
@ -91,0 +100,4 @@
// EngineEvacuationWorkersCount returns value of "evacuation_workers_count" config parmeter from "storage" section.
func EngineEvacuationWorkersCount(c *config.Config) uint32 {
if v := config.Uint32Safe(c.Sub(subsection), "evacuation_workers_count"); v > 0 {
Owner

worker_count (we already have 4 such parameters), for workers_count only rebuild, will fix eventually.

`worker_count` (we already have 4 such parameters), for `workers_count` only rebuild, will fix eventually.
@ -220,2 +224,4 @@
log: &logger.Logger{Logger: zap.L()},
shardPoolSize: 20,
evacuationWorkers: 5,
Owner

Please, adjust the docs/storage-node-configuration.md.
Also, it is a bit difficult to comprehend what is the difference between these 2 types of worker_count.
I don't see myself ever being able to make an informed decision about this number -- it is either parallel or not.

If there are other opinions, I would like to hear :)

Please, adjust the `docs/storage-node-configuration.md`. Also, it is a bit difficult to comprehend what is the difference between these 2 types of `worker_count`. I don't see myself ever being able to make an informed decision about this number -- it is either parallel or not. If there are other opinions, I would like to hear :)
dstepanov-yadro approved these changes 2024-09-10 12:02:57 +00:00
Dismissed
Owner

@acid-ant Please, take a look at the data-race.

@acid-ant Please, take a look at the data-race.
acid-ant force-pushed feat/evac-parallel from aa21a71b61 to 5daff28e69 2024-09-18 09:17:05 +00:00 Compare
acid-ant dismissed dstepanov-yadro's review 2024-09-18 09:17:05 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

Author
Member

Add new method ListConcurrently to metabase API which allow iterating over it concurrently: creates routines to iterate over containers and objects concurrently.

Add new method `ListConcurrently` to `metabase` API which allow iterating over it concurrently: creates routines to iterate over containers and objects concurrently.
acid-ant changed title from engine: Evacuate object from shards concurrently to WIP: engine: Evacuate object from shards concurrently 2024-09-18 10:32:43 +00:00
acid-ant force-pushed feat/evac-parallel from 5daff28e69 to 0613c54bee 2024-09-18 12:02:32 +00:00 Compare
Author
Member

Fix evacuation tests.

Fix evacuation tests.
acid-ant changed title from WIP: engine: Evacuate object from shards concurrently to engine: Evacuate object from shards concurrently 2024-09-18 12:03:13 +00:00
acid-ant force-pushed feat/evac-parallel from 0613c54bee to af6139b981 2024-09-18 12:28:58 +00:00 Compare
Author
Member

Minor changes.

Minor changes.
aarifullin reviewed 2024-09-18 13:07:40 +00:00
@ -294,2 +291,2 @@
return err
}
eg.Go(func() error {
if err = e.evacuateShard(egCtx, shardID, prm, res, shards, shardsToEvacuate); err != nil {
Member

Let's consider shard1 and shard2. Should shard1's evacuation stop if shard2's evacuation has been failed? If it shouldn't, then egCtx can't be used as it cancells other jobs too

Let's consider `shard1` and `shard2`. Should `shard1`'s evacuation stop if `shard2`'s evacuation has been failed? If it **shouldn't**, then `egCtx` can't be used as it cancells other jobs too
Author
Member

Good question. I think process should stop faster. To indicate that we have a problematic situation with disks. That will allow us to rerun evacuation and exclude defected disk.

Good question. I think process should stop faster. To indicate that we have a problematic situation with disks. That will allow us to rerun evacuation and exclude defected disk.
aarifullin approved these changes 2024-09-19 08:13:30 +00:00
Dismissed
aarifullin left a comment
Member

OK with me

OK with me
acid-ant changed title from engine: Evacuate object from shards concurrently to WIP: engine: Evacuate object from shards concurrently 2024-09-19 10:21:47 +00:00
acid-ant force-pushed feat/evac-parallel from af6139b981 to eada9bd28b 2024-09-20 17:14:24 +00:00 Compare
acid-ant dismissed aarifullin's review 2024-09-20 17:14:25 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

acid-ant force-pushed feat/evac-parallel from eada9bd28b to 0a13e7d946 2024-09-22 15:44:23 +00:00 Compare
acid-ant force-pushed feat/evac-parallel from 0a13e7d946 to fafee24650 2024-09-22 15:45:15 +00:00 Compare
acid-ant requested review from storage-core-committers 2024-09-22 15:52:32 +00:00
acid-ant requested review from storage-core-developers 2024-09-22 15:52:33 +00:00
acid-ant force-pushed feat/evac-parallel from fafee24650 to 51e64c3101 2024-09-22 16:04:27 +00:00 Compare
acid-ant changed title from WIP: engine: Evacuate object from shards concurrently to engine: Evacuate object from shards concurrently 2024-09-22 17:37:48 +00:00
acid-ant force-pushed feat/evac-parallel from 4b52a38812 to 620793008d 2024-09-22 19:19:52 +00:00 Compare
dstepanov-yadro requested changes 2024-09-23 07:27:23 +00:00
Dismissed
@ -21,6 +21,9 @@ const (
noProgressFlag = "no-progress"
scopeFlag = "scope"
containerWorkerCountFlag = "container-worker-count"

I have doubts that this parameter is necessary. Can you test with a value of 1 and a value of 10, for example, for 100 containers?

I have doubts that this parameter is necessary. Can you test with a value of 1 and a value of 10, for example, for 100 containers?
Author
Member

It will be helpful for case when we need to evacuate containers with policy REP 1 only.
In a bad case, we need ~100ms to get info about container from neo-go.

For 1_000 containers, we need around 1m to iterate over containers with container-worker-count = 1.
For 1_000 containers, we need around 10s to iterate over containers with container-worker-count = 10.

For 10_000 containers, we need around 16m to iterate over containers with container-worker-count = 1.
For 10_000 containers, we need around 1m40s to iterate over containers with container-worker-count = 10.

Measured with unit test here.

It will be helpful for case when we need to evacuate containers with policy `REP 1` only. In a bad case, we need ~100ms to get info about container from `neo-go`. For 1_000 containers, we need around 1m to iterate over containers with `container-worker-count` = 1. For 1_000 containers, we need around 10s to iterate over containers with `container-worker-count` = 10. For 10_000 containers, we need around 16m to iterate over containers with `container-worker-count` = 1. For 10_000 containers, we need around 1m40s to iterate over containers with `container-worker-count` = 10. Measured with unit test [here]([url](https://git.frostfs.info/TrueCloudLab/frostfs-node/commit/2218e42a10518d6be4d4375003a0a0dbe8e348d6)).
dstepanov-yadro marked this conversation as resolved
@ -16,2 +16,4 @@
// process object PUT operations in a storage engine.
ShardPoolSizeDefault = 20
// EvacuationShardWorkerCountDefault is a default value of the count of shards
// evacuees concurrently.

Fix typo

Fix typo
dstepanov-yadro marked this conversation as resolved
@ -176,0 +171,4 @@
|-------------------------------------|-----------------------------------|---------------|------------------------------------------------------------------------------------------------------------------|
| `shard_pool_size` | `int` | `20` | Pool size for shard workers. Limits the amount of concurrent `PUT` operations on each shard. |
| `shard_ro_error_threshold` | `int` | `0` | Maximum amount of storage errors to encounter before shard automatically moves to `Degraded` or `ReadOnly` mode. |
| `evacuation_container_worker_count` | `int` | `10` | The count of the concurrent container evacuation workers. |

Adding these parameters to the configuration looks redundant. Let's add the default value to the frostfs-cli, and on the server check that the value passed in the RPC call is greater than zero?

Adding these parameters to the configuration looks redundant. Let's add the default value to the `frostfs-cli`, and on the server check that the value passed in the RPC call is greater than zero?
Author
Member

I disagree. With this approach, you need to configure once how you want to utilize resources for evacuation, instead of remember parameters for evacuation.

I disagree. With this approach, you need to configure once how you want to utilize resources for evacuation, instead of remember parameters for evacuation.

Then remove the same parameters from the RPC call? It looks strange if the service administrator has configured these parameters in the configuration file, and then he also makes a disk evacuation call, but with different parameters.

Then remove the same parameters from the RPC call? It looks strange if the service administrator has configured these parameters in the configuration file, and then he also makes a disk evacuation call, but with different parameters.
Author
Member

I think it is more flexible to have two ways to configure evacuation parameters instead of being tied to only one.

I think it is more flexible to have two ways to configure evacuation parameters instead of being tied to only one.

But this leads to a complication of the code and understanding. The evacuation is called by the service engineer or the server administrator. The configuration is also changed by the service engineer or the server administrator. The only place where similar behavior occurs is when changing the shard mode: this can be done through a frostfs-cli call or through a configuration change. But this was necessary for the correct disk replacement procedure, and not dictated by concern for convenience. By the way, as far as I remember, we have been trying to avoid such duplication for a very long time.
Anyway, if you're sure it's necessary, I'll accept it.

But this leads to a complication of the code and understanding. The evacuation is called by the service engineer or the server administrator. The configuration is also changed by the service engineer or the server administrator. The only place where similar behavior occurs is when changing the shard mode: this can be done through a `frostfs-cli` call or through a configuration change. But this was necessary for the correct disk replacement procedure, and not dictated by concern for convenience. By the way, as far as I remember, we have been trying to avoid such duplication for a very long time. Anyway, if you're sure it's necessary, I'll accept it.
Author
Member

Exactly, I also remember such activity by reducing the amount of settings available for the service engineer and server administrator. This beautiful morning, I'm ready to remove redundant settings from the changes.

Exactly, I also remember such activity by reducing the amount of settings available for the service engineer and server administrator. This beautiful morning, I'm ready to remove redundant settings from the changes.
dstepanov-yadro marked this conversation as resolved
@ -287,12 +296,44 @@ func (e *StorageEngine) evacuateShards(ctx context.Context, shardIDs []string, p
return err
}
egShard, _ := errgroup.WithContext(ctx)

Previously, if the evacuation of one shard failed, the evacuation of the next shard was not started. Now this behavior has changed

It is necessary either to clearly justify why this behavior has changed, or to do as it was before (use ctx).

Previously, if the evacuation of one shard failed, the evacuation of the next shard was not started. Now this behavior has changed It is necessary either to clearly justify why this behavior has changed, or to do as it was before (use ctx).
Author
Member

I don't see the reason why we need to keep order for shards - the amount of objects and containers per shard is unpredictable. If we start to process all shard at the same time, that will help us to better utilize node resources. The corner case here is shard with one container with policy REP 1.

I don't see the reason why we need to keep order for shards - the amount of objects and containers per shard is unpredictable. If we start to process all shard at the same time, that will help us to better utilize node resources. The corner case here is shard with one container with policy `REP 1`.
Author
Member

As for operation cancelation. The processing of other shards will fail too, because once we will fail on object evacuation, then it will be impossible, to run new routine for object evacuation for another container, also it will be impossible to process next container and that will stop the shard evacuation.

As for operation cancelation. The processing of other shards will fail too, because once we will fail on object evacuation, then it will be impossible, to run new routine for object evacuation for another container, also it will be impossible to process next container and that will stop the shard evacuation.

Could you explain: it will be impossible, to run new routine for object evacuation for another container
Where is such check?

Could you explain: `it will be impossible, to run new routine for object evacuation for another container` Where is such check?
Author
Member

Shame on me, I've taken a look closer at errorgroup.Go() and now I see my fault. Updated a bit to use one context for several error groups. Also added a unit test to check cancelation by error.

Shame on me, I've taken a look closer at `errorgroup.Go()` and now I see my fault. Updated a bit to use one context for several `error groups`. Also added a unit test to check cancelation by error.
dstepanov-yadro marked this conversation as resolved
acid-ant force-pushed feat/evac-parallel from 620793008d to b3e9139629 2024-09-23 21:05:22 +00:00 Compare
acid-ant force-pushed feat/evac-parallel from b3e9139629 to cf588ad386 2024-09-24 07:02:33 +00:00 Compare
acid-ant force-pushed feat/evac-parallel from cf588ad386 to 900b3b7ca1 2024-09-24 07:04:10 +00:00 Compare
Author
Member

Removed redundant parameters from node config.

Removed redundant parameters from `node` config.
dstepanov-yadro requested changes 2024-09-24 07:09:16 +00:00
Dismissed
@ -262,0 +317,4 @@
continue
}
bkt := tx.Bucket(name)

To check that cursor's current item is bucket just compare value with nil. Also root cursor always iterates over buckets. Also bkt := tx.Bucket(name) make another one cursor seek inside, so this looks redundant.

To check that cursor's current item is bucket just compare value with nil. Also root cursor always iterates over buckets. Also `bkt := tx.Bucket(name)` make another one cursor seek inside, so this looks redundant.
Author
Member

Thanks, fixed. That was copy-paste from listWithCursor.

Thanks, fixed. That was copy-paste from `listWithCursor`.
dstepanov-yadro marked this conversation as resolved
@ -167,0 +187,4 @@
))
defer span.End()
if s.GetMode().NoMetabase() {

It is necessary to use a read lock on the mutex for whole operation, not just to get mode. Or I missing something?

It is necessary to use a read lock on the mutex for whole operation, not just to get mode. Or I missing something?
Author
Member

You are right, RLock here and in IterateOverObjectsInContainer is required. Fixed.

You are right, `RLock` here and in `IterateOverObjectsInContainer` is required. Fixed.
dstepanov-yadro marked this conversation as resolved
acid-ant force-pushed feat/evac-parallel from 900b3b7ca1 to 53b810bec4 2024-09-24 07:35:41 +00:00 Compare
dstepanov-yadro reviewed 2024-09-24 07:42:52 +00:00
@ -371,6 +378,8 @@ func initControlStartEvacuationShardCmd() {
flags.String(scopeFlag, scopeAll, fmt.Sprintf("Evacuation scope; possible values: %s, %s, %s", scopeTrees, scopeObjects, scopeAll))
flags.Bool(awaitFlag, false, "Block execution until evacuation is completed")
flags.Bool(noProgressFlag, false, fmt.Sprintf("Print progress if %s provided", awaitFlag))
flags.Uint32(containerWorkerCountFlag, 0, "Count of concurrent container evacuation workers")

maybe set default values here, but not on the server side?

maybe set default values here, but not on the server side?
Author
Member

No, that will lead to resource leak on a server side - when another client may set it zero values.

No, that will lead to resource leak on a server side - when another client may set it zero values.

On server side just check that those values are greater than zero.

On server side just check that those values are greater than zero.
dstepanov-yadro marked this conversation as resolved
@ -312,0 +356,4 @@
func (e *StorageEngine) createErrorGroupsForEvacuation(ctx context.Context, prm EvacuateShardPrm) (
context.Context, context.CancelCauseFunc, *errgroup.Group, *errgroup.Group, *errgroup.Group,
) {
operationCtx, cancel := context.WithCancelCause(ctx)

It is ok, but looks redundant.
This code will do almost the same:

egShard, ctx := errgroup.WithContext(ctx)
egContainer, ctx := errgroup.WithContext(ctx)
egObject, ctx := errgroup.WithContext(ctx)
It is ok, but looks redundant. This code will do almost the same: ``` egShard, ctx := errgroup.WithContext(ctx) egContainer, ctx := errgroup.WithContext(ctx) egObject, ctx := errgroup.WithContext(ctx) ```
Author
Member

eg.Wait() cancels context. Routines which process containers and shards will end earlier than routines for objects. That is why they can't depend on context of each other.

`eg.Wait()` cancels context. Routines which process containers and shards will end earlier than routines for objects. That is why they can't depend on context of each other.
dstepanov-yadro marked this conversation as resolved
aarifullin approved these changes 2024-09-24 08:08:58 +00:00
Dismissed
acid-ant force-pushed feat/evac-parallel from 53b810bec4 to cc967c62c7 2024-09-24 08:35:50 +00:00 Compare
dstepanov-yadro reviewed 2024-09-24 08:43:17 +00:00
@ -64,0 +69,4 @@
// IterateOverObjectsInContainerPrm contains parameters for IterateOverObjectsInContainer operation.
type IterateOverObjectsInContainerPrm struct {
// BucketName container's bucket name

The sentence must end with a period

The sentence must end with a period
dstepanov-yadro reviewed 2024-09-24 08:50:18 +00:00
@ -262,0 +313,4 @@
if cidRaw == nil {
continue
}
if prefix != primaryPrefix && prefix != lockersPrefix && prefix != tombstonePrefix {

Replace with

for _, prefix := range []int{primaryPrefix, lockersPrefix, tombstonePrefix} {
   c := tx.Cursor()
   for name, _ := c.Seek(prefix); name != nil && bytes.HasPrefix(name, prefix); name, _ = c.Next() {
   ...
   }
}

to reduce seeks

Replace with ``` for _, prefix := range []int{primaryPrefix, lockersPrefix, tombstonePrefix} { c := tx.Cursor() for name, _ := c.Seek(prefix); name != nil && bytes.HasPrefix(name, prefix); name, _ = c.Next() { ... } } ``` to reduce seeks
Author
Member

Don't see benefits from this, could you point me where I'm wrong. Even if we start from the particular prefix, we need to iterate over all which follows them, and with your approach we need to do that for all other prefixes. In the current implementation, we need to do that iteration only once.

Don't see benefits from this, could you point me where I'm wrong. Even if we start from the particular prefix, we need to iterate over all which follows them, and with your approach we need to do that for all other prefixes. In the current implementation, we need to do that iteration only once.

Now iteration performed over all of buckets with all of prefixes: small, root, owner, user attributes and others.
After fix iteration will be performed only on particular prefixes: name != nil && bytes.HasPrefix(name, prefix);

Now iteration performed over all of buckets with all of prefixes: small, root, owner, user attributes and others. After fix iteration will be performed only on particular prefixes: `name != nil && bytes.HasPrefix(name, prefix); `
Author
Member

Gotcha, fixed.

Gotcha, fixed.
dstepanov-yadro marked this conversation as resolved
dstepanov-yadro reviewed 2024-09-24 08:51:32 +00:00
@ -262,0 +367,4 @@
k, v := c.First()
var containerID cid.ID
cidRaw, prefix := parseContainerIDWithPrefix(&containerID, prm.BucketName)

Could be done before transaction start

Could be done before transaction start
Author
Member

Done.

Done.
dstepanov-yadro marked this conversation as resolved
dstepanov-yadro reviewed 2024-09-24 08:52:16 +00:00
@ -262,0 +360,4 @@
}
func (db *DB) iterateOverObjectsInContainer(ctx context.Context, tx *bbolt.Tx, prm IterateOverObjectsInContainerPrm) error {
bkt := tx.Bucket(prm.BucketName)

Add check against nil

Add check against nil
Author
Member

Added.

Added.
dstepanov-yadro marked this conversation as resolved
acid-ant force-pushed feat/evac-parallel from cc967c62c7 to c8eae4ace6 2024-09-24 09:48:57 +00:00 Compare
acid-ant dismissed aarifullin's review 2024-09-24 09:48:57 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

acid-ant force-pushed feat/evac-parallel from c8eae4ace6 to 34e6a309c6 2024-09-24 09:50:33 +00:00 Compare
aarifullin approved these changes 2024-09-24 09:52:46 +00:00
dstepanov-yadro approved these changes 2024-09-24 10:59:24 +00:00
acid-ant merged commit 34e6a309c6 into master 2024-09-24 12:02:17 +00:00
acid-ant deleted branch feat/evac-parallel 2024-09-24 12:02:18 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1356
No description provided.