WIP: Process EC container for Get/GetRange/Head concurrently #1237

Closed
dstepanov-yadro wants to merge 6 commits from dstepanov-yadro/frostfs-node:fix/ec_get_failover into master

Now get requests for EC containers run concurrently with concurrency limit equal to EC data count.

Now `get` requests for EC containers run concurrently with concurrency limit equal to EC data count.
dstepanov-yadro force-pushed fix/ec_get_failover from 32afe410d3 to c9b1af0635 2024-07-10 07:38:30 +00:00 Compare
dstepanov-yadro changed title from WIP: Process EC container for Get/GetRange/Head concurrently to Process EC container for Get/GetRange/Head concurrently 2024-07-10 09:07:29 +00:00
dstepanov-yadro force-pushed fix/ec_get_failover from c9b1af0635 to 30c51fbd5d 2024-07-10 09:07:32 +00:00 Compare
dstepanov-yadro requested review from storage-core-committers 2024-07-10 09:07:40 +00:00
dstepanov-yadro requested review from storage-core-developers 2024-07-10 09:07:47 +00:00
dstepanov-yadro force-pushed fix/ec_get_failover from 31e94be0b6 to e33752e034 2024-07-10 09:18:00 +00:00 Compare
acid-ant reviewed 2024-07-10 09:28:44 +00:00
acid-ant reviewed 2024-07-10 09:34:58 +00:00
@ -94,0 +149,4 @@
}()
err := r.processECNodesRequests(ctx, nodes, dataCount)
cancel()
Member

You are canceling context here and using it below for r.writeCollectedObject(ctx), looks wrong.

You are canceling context here and using it below for `r.writeCollectedObject(ctx)`, looks wrong.
Author
Member

Right, fixed.

Right, fixed.
acid-ant marked this conversation as resolved
dstepanov-yadro force-pushed fix/ec_get_failover from e33752e034 to c4d310d669 2024-07-10 09:38:27 +00:00 Compare
aarifullin reviewed 2024-07-10 10:33:19 +00:00
@ -94,0 +170,4 @@
defer cancel()
var wg sync.WaitGroup
wg.Add(1)
go func() {
Member

[Optional]

What do you think: won't this approach cause goroutine spam if too many get-like requests are processed at the same time? Using errgroup would be safe I guess

[Optional] What do you think: won't this approach cause goroutine spam if too many get-like requests are processed at the same time? Using `errgroup` would be safe I guess
Author
Member

I didn't catch the thought.
errgroup creates the same goroutines.

I didn't catch the thought. `errgroup` creates the same goroutines.
Member

errgroup creates the same goroutines.

Yes, it does but with the fixed number of workers.

Nevermind, sorry. I incorrecly read this part:

go func() {
		defer wg.Done()

		for {
			batch := traverser.Next()
			if len(batch) == 0 {
				r.log.Debug(logs.NoMoreNodesAbortPlacementIteration)
				close(nodes)
				return
			}
			for _, node := range batch {
				select {
				case <-ctx.Done():
					return
				case nodes <- node:
				}
			}
		}
	}()

I've mistaken this goroutine generation within for-loop

> errgroup creates the same goroutines. Yes, it does but with the fixed number of workers. Nevermind, sorry. I incorrecly read this part: ```go go func() { defer wg.Done() for { batch := traverser.Next() if len(batch) == 0 { r.log.Debug(logs.NoMoreNodesAbortPlacementIteration) close(nodes) return } for _, node := range batch { select { case <-ctx.Done(): return case nodes <- node: } } } }() ``` I've mistaken this goroutine generation within for-loop
aarifullin marked this conversation as resolved
fyrchik reviewed 2024-07-10 11:22:24 +00:00
@ -55,1 +65,4 @@
// In most cases, the value from the cache will be returned.
// Container appears in the cache when traverser generates.
cnr, err := r.containerSource.Get(r.address().Container())
Owner

Can we get it from the traverser directly?

Can we get it from the traverser directly?
Author
Member

Container is not traverser part, but container is requered to generate traverser.
Fixed: now traverser generator returns container.

Container is not traverser part, but container is requered to generate traverser. Fixed: now traverser generator returns container.
@ -94,0 +129,4 @@
var errECInfo *objectSDK.ECInfoError
var errRemoved *apistatus.ObjectAlreadyRemoved
var errOutOfRange *apistatus.ObjectOutOfRange
var errSuccess *ecGetSuccessErr
Owner

What is the benefit of using a separate error type for successfully performed requests?
Go has multiple value returns can we use them?

What is the benefit of using a separate error type for successfully performed requests? Go has multiple value returns can we use them?
Author
Member

To cancel errgroup's context and do not run other requests: if eg.Go's argument returns error, then errgroup cancels context. Of course it is possible to make the same thing without errgroup, but it will be not so simple.
Also added ctx.Done check.

To cancel `errgroup`'s context and do not run other requests: if `eg.Go`'s argument returns error, then errgroup cancels context. Of course it is possible to make the same thing without errgroup, but it will be not so simple. Also added `ctx.Done` check.
Owner

It can be done with a special error, e.g. errStopIteration is what we use somewhere.

It can be done with a special error, e.g. `errStopIteration` is what we use somewhere.
Author
Member

But ecGetSuccessErr is special error used only to stop EC handling and pass result object.

But `ecGetSuccessErr` is special error used only to stop EC handling and pass result object.
@ -94,0 +132,4 @@
var errSuccess *ecGetSuccessErr
switch {
case err == nil: // nil returns if all nodes failed or incomplete EC info received
Owner

s/returns/is returned/

`s/returns/is returned/`
Author
Member

done

done
fyrchik marked this conversation as resolved
@ -94,0 +209,4 @@
r.log.Debug(logs.GetCouldNotConstructRemoteNodeClient, zap.String("node_key", hex.EncodeToString(info.PublicKey())))
return err
}
obj, err := r.getRemote(egCtx, rs, info)
Owner

So for each node we call getRemote with the same parameters.
Where is the place in code where we do combine chunks in the final object?

So for each node we call `getRemote` with the same parameters. Where is the place in code where we do combine chunks in the final object?
Author
Member

func (a *assemblerec) retrieveParts(mainCtx context.Context, headOnly bool) []*objectSDK.Object {

https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/3bf6e6dde60f15f4f90caf44eda429e8f6269cc1/pkg/services/object/get/assemblerec.go#L122
Owner

I have realised where my confusion comes from: we have 2 steps:

  1. Get ECInfo from enough nodes
  2. Pull EC chunks

The "enough nodes" is defined on step 1, even though nodes may go down at step 2. In this case we will fail to fetch object?
The situation will happen during failover tests.

I have realised where my confusion comes from: we have 2 steps: 1. Get ECInfo from enough nodes 2. Pull EC chunks The "enough nodes" is defined on step 1, even though nodes may go down at step 2. In this case we will fail to fetch object? The situation _will_ happen during failover tests.
Author
Member

Right, if node fails between step 1 and step 2, then such request will fail.

Right, if node fails between step 1 and step 2, then such request will fail.
@ -94,0 +230,4 @@
ecInfoGuard.Lock()
defer ecInfoGuard.Unlock()
r.infoEC = util.MergeECInfo(errECInfo.ECInfo(), r.infoEC)
if r.isRaw() {
Owner

Could you explain this line a bit? Why do we NOT return error if the number of chunks is not equal to the expected?

Could you explain this line a bit? Why do we NOT return error if the number of chunks is not equal to the expected?
Author
Member

In case of HEAD request with raw flag it is expected to get raw EC info error with all chunks.

In case of `HEAD` request with `raw` flag it is expected to get raw EC info error with all chunks.
dstepanov-yadro force-pushed fix/ec_get_failover from c4d310d669 to d731be8011 2024-07-10 11:39:49 +00:00 Compare
dstepanov-yadro force-pushed fix/ec_get_failover from d731be8011 to 7f7d04882a 2024-07-10 11:44:56 +00:00 Compare
elebedeva reviewed 2024-07-10 12:02:01 +00:00
@ -54,2 +64,4 @@
}
// In most cases, the value from the cache will be returned.
// Container appears in the cache when traverser generates.
Member

Looks like a grammar mistake: traverser generates --> traverser is generated.

Looks like a grammar mistake: `traverser generates` --> `traverser is generated`.
Author
Member

Fixed

Fixed
elebedeva marked this conversation as resolved
dstepanov-yadro force-pushed fix/ec_get_failover from 99790c09a0 to e89f8bb6c7 2024-07-10 12:17:28 +00:00 Compare
dstepanov-yadro force-pushed fix/ec_get_failover from e89f8bb6c7 to eebf7b4bfb 2024-07-10 12:21:09 +00:00 Compare
aarifullin approved these changes 2024-07-10 12:55:30 +00:00
acid-ant reviewed 2024-07-10 13:21:39 +00:00
@ -65,19 +65,10 @@ func (r *request) processNode(ctx context.Context, info client.NodeInfo) bool {
mergeSplitInfo(r.splitInfo(), errSplitInfo.SplitInfo())
r.err = objectSDK.NewSplitInfoError(r.infoSplit)
case errors.As(err, &errECInfo):
Member

Looks like part of the code in the default section is no more needed:

		if r.status == statusEC {
			// we need to continue getting another chunks  from another nodes
...
Looks like part of the code in the `default` section is no more needed: ``` if r.status == statusEC { // we need to continue getting another chunks from another nodes ... ```
Author
Member

done

done
acid-ant marked this conversation as resolved
dstepanov-yadro force-pushed fix/ec_get_failover from eebf7b4bfb to 5d36e847b0 2024-07-10 13:24:59 +00:00 Compare
acid-ant approved these changes 2024-07-10 14:15:36 +00:00
aarifullin approved these changes 2024-07-12 08:03:51 +00:00
dstepanov-yadro force-pushed fix/ec_get_failover from 5d36e847b0 to 906570789c 2024-07-12 08:44:28 +00:00 Compare
dstepanov-yadro force-pushed fix/ec_get_failover from 906570789c to 50b9dea0fb 2024-07-12 09:40:10 +00:00 Compare
achuprov approved these changes 2024-07-12 09:47:38 +00:00
aarifullin approved these changes 2024-07-12 12:52:35 +00:00
dstepanov-yadro changed title from Process EC container for Get/GetRange/Head concurrently to WIP: Process EC container for Get/GetRange/Head concurrently 2024-07-12 16:41:54 +00:00
Author
Member

Incorrect implementation, close

Incorrect implementation, close
dstepanov-yadro closed this pull request 2024-07-15 05:22:40 +00:00
All checks were successful
Vulncheck / Vulncheck (pull_request) Successful in 4m16s
Required
Details
DCO action / DCO (pull_request) Successful in 4m3s
Required
Details
Build / Build Components (1.21) (pull_request) Successful in 5m25s
Required
Details
Build / Build Components (1.22) (pull_request) Successful in 5m33s
Required
Details
Tests and linters / gopls check (pull_request) Successful in 6m40s
Required
Details
Tests and linters / Staticcheck (pull_request) Successful in 7m32s
Required
Details
Tests and linters / Lint (pull_request) Successful in 8m20s
Required
Details
Pre-commit hooks / Pre-commit (pull_request) Successful in 11m15s
Required
Details
Tests and linters / Tests with -race (pull_request) Successful in 11m35s
Required
Details
Tests and linters / Tests (1.21) (pull_request) Successful in 11m50s
Required
Details
Tests and linters / Tests (1.22) (pull_request) Successful in 12m10s
Required
Details

Pull request closed

Sign in to join this conversation.
No milestone
No project
No assignees
6 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1237
No description provided.