WIP: Process EC container for Get/GetRange/Head concurrently #1237

dstepanov-yadro · 2024-07-10T07:35:06Z

dstepanov-yadro commented

2024-07-10 07:35:06 +00:00

Now get requests for EC containers run concurrently with concurrency limit equal to EC data count.

Now `get` requests for EC containers run concurrently with concurrency limit equal to EC data count.

dstepanov-yadro force-pushed fix/ec_get_failover from 32afe410d3 to c9b1af0635

2024-07-10 07:38:30 +00:00

Compare

dstepanov-yadro changed title from ~~WIP: Process EC container for Get/GetRange/Head concurrently~~ to Process EC container for Get/GetRange/Head concurrently

2024-07-10 09:07:29 +00:00

dstepanov-yadro force-pushed fix/ec_get_failover from c9b1af0635 to 30c51fbd5d

2024-07-10 09:07:32 +00:00

Compare

dstepanov-yadro requested review from storage-core-committers 2024-07-10 09:07:40 +00:00

dstepanov-yadro requested review from storage-core-developers 2024-07-10 09:07:47 +00:00

dstepanov-yadro force-pushed fix/ec_get_failover from 31e94be0b6 to e33752e034

2024-07-10 09:18:00 +00:00

Compare

acid-ant reviewed 2024-07-10 09:28:44 +00:00

acid-ant reviewed 2024-07-10 09:34:58 +00:00

pkg/services/object/get/container.go Outdated

					
				@ -94,0 +149,4 @@

					}()

					err := r.processECNodesRequests(ctx, nodes, dataCount)

					cancel()

acid-ant commented

2024-07-10 09:34:57 +00:00

You are canceling context here and using it below for r.writeCollectedObject(ctx), looks wrong.

You are canceling context here and using it below for `r.writeCollectedObject(ctx)`, looks wrong.

dstepanov-yadro commented

2024-07-10 09:40:18 +00:00

Right, fixed.

acid-ant marked this conversation as resolved

dstepanov-yadro force-pushed fix/ec_get_failover from e33752e034 to c4d310d669

2024-07-10 09:38:27 +00:00

Compare

aarifullin reviewed 2024-07-10 10:33:19 +00:00

pkg/services/object/get/container.go Outdated

					
				@ -94,0 +170,4 @@

					defer cancel()

					var wg sync.WaitGroup

					wg.Add(1)

					go func() {

aarifullin commented

2024-07-10 10:33:19 +00:00

[Optional]

What do you think: won't this approach cause goroutine spam if too many get-like requests are processed at the same time? Using errgroup would be safe I guess

[Optional] What do you think: won't this approach cause goroutine spam if too many get-like requests are processed at the same time? Using `errgroup` would be safe I guess

dstepanov-yadro commented

2024-07-10 12:28:23 +00:00

I didn't catch the thought.
errgroup creates the same goroutines.

I didn't catch the thought. `errgroup` creates the same goroutines.

aarifullin commented

2024-07-10 12:54:15 +00:00

errgroup creates the same goroutines.

Yes, it does but with the fixed number of workers.

Nevermind, sorry. I incorrecly read this part:

go func() {
		defer wg.Done()

		for {
			batch := traverser.Next()
			if len(batch) == 0 {
				r.log.Debug(logs.NoMoreNodesAbortPlacementIteration)
				close(nodes)
				return
			}
			for _, node := range batch {
				select {
				case <-ctx.Done():
					return
				case nodes <- node:
				}
			}
		}
	}()

I've mistaken this goroutine generation within for-loop

> errgroup creates the same goroutines. Yes, it does but with the fixed number of workers. Nevermind, sorry. I incorrecly read this part: ```go go func() { defer wg.Done() for { batch := traverser.Next() if len(batch) == 0 { r.log.Debug(logs.NoMoreNodesAbortPlacementIteration) close(nodes) return } for _, node := range batch { select { case <-ctx.Done(): return case nodes <- node: } } } }() ``` I've mistaken this goroutine generation within for-loop

aarifullin marked this conversation as resolved

fyrchik reviewed 2024-07-10 11:22:24 +00:00

pkg/services/object/get/container.go Outdated

					
				@ -55,1 +65,4 @@

					// In most cases, the value from the cache will be returned.

					// Container appears in the cache when traverser generates.

					cnr, err := r.containerSource.Get(r.address().Container())

fyrchik commented

2024-07-10 11:06:51 +00:00

Can we get it from the traverser directly?

dstepanov-yadro commented

2024-07-10 12:15:44 +00:00

Container is not traverser part, but container is requered to generate traverser.
Fixed: now traverser generator returns container.

Container is not traverser part, but container is requered to generate traverser. Fixed: now traverser generator returns container.

pkg/services/object/get/container.go Outdated

					
				@ -94,0 +129,4 @@

					var errECInfo *objectSDK.ECInfoError

					var errRemoved *apistatus.ObjectAlreadyRemoved

					var errOutOfRange *apistatus.ObjectOutOfRange

					var errSuccess *ecGetSuccessErr

fyrchik commented

2024-07-10 11:09:44 +00:00

What is the benefit of using a separate error type for successfully performed requests?
Go has multiple value returns can we use them?

What is the benefit of using a separate error type for successfully performed requests? Go has multiple value returns can we use them?

dstepanov-yadro commented

2024-07-10 11:43:25 +00:00

To cancel errgroup's context and do not run other requests: if eg.Go's argument returns error, then errgroup cancels context. Of course it is possible to make the same thing without errgroup, but it will be not so simple.
Also added ctx.Done check.

To cancel `errgroup`'s context and do not run other requests: if `eg.Go`'s argument returns error, then errgroup cancels context. Of course it is possible to make the same thing without errgroup, but it will be not so simple. Also added `ctx.Done` check.

fyrchik commented

2024-07-10 11:57:41 +00:00

It can be done with a special error, e.g. errStopIteration is what we use somewhere.

It can be done with a special error, e.g. `errStopIteration` is what we use somewhere.

dstepanov-yadro commented

2024-07-10 12:01:32 +00:00

But ecGetSuccessErr is special error used only to stop EC handling and pass result object.

But `ecGetSuccessErr` is special error used only to stop EC handling and pass result object.

pkg/services/object/get/container.go Outdated

					
				@ -94,0 +132,4 @@

					var errSuccess *ecGetSuccessErr

					switch {

					case err == nil: // nil returns if all nodes failed or incomplete EC info received

fyrchik commented

2024-07-10 11:08:23 +00:00

s/returns/is returned/

`s/returns/is returned/`

dstepanov-yadro commented

2024-07-10 11:44:53 +00:00

done

fyrchik marked this conversation as resolved

pkg/services/object/get/container.go Outdated

					
				@ -94,0 +209,4 @@

								r.log.Debug(logs.GetCouldNotConstructRemoteNodeClient, zap.String("node_key", hex.EncodeToString(info.PublicKey())))

								return err

							}

							obj, err := r.getRemote(egCtx, rs, info)

fyrchik commented

2024-07-10 11:22:17 +00:00

So for each node we call getRemote with the same parameters.
Where is the place in code where we do combine chunks in the final object?

So for each node we call `getRemote` with the same parameters. Where is the place in code where we do combine chunks in the final object?

dstepanov-yadro commented

2024-07-10 11:49:30 +00:00

func (a *assemblerec) retrieveParts(mainCtx context.Context, headOnly bool) []*objectSDK.Object {

https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/3bf6e6dde60f15f4f90caf44eda429e8f6269cc1/pkg/services/object/get/assemblerec.go#L122

fyrchik commented

2024-07-10 12:08:34 +00:00

I have realised where my confusion comes from: we have 2 steps:

Get ECInfo from enough nodes
Pull EC chunks

The "enough nodes" is defined on step 1, even though nodes may go down at step 2. In this case we will fail to fetch object?
The situation will happen during failover tests.

I have realised where my confusion comes from: we have 2 steps: 1. Get ECInfo from enough nodes 2. Pull EC chunks The "enough nodes" is defined on step 1, even though nodes may go down at step 2. In this case we will fail to fetch object? The situation _will_ happen during failover tests.

dstepanov-yadro commented

2024-07-10 12:13:15 +00:00

Right, if node fails between step 1 and step 2, then such request will fail.

pkg/services/object/get/container.go Outdated

					
				@ -94,0 +230,4 @@

								ecInfoGuard.Lock()

								defer ecInfoGuard.Unlock()

								r.infoEC = util.MergeECInfo(errECInfo.ECInfo(), r.infoEC)

								if r.isRaw() {

fyrchik commented

2024-07-10 11:12:22 +00:00

Could you explain this line a bit? Why do we NOT return error if the number of chunks is not equal to the expected?

dstepanov-yadro commented

2024-07-10 11:46:23 +00:00

In case of HEAD request with raw flag it is expected to get raw EC info error with all chunks.

In case of `HEAD` request with `raw` flag it is expected to get raw EC info error with all chunks.

dstepanov-yadro force-pushed fix/ec_get_failover from c4d310d669 to d731be8011

2024-07-10 11:39:49 +00:00

Compare

dstepanov-yadro force-pushed fix/ec_get_failover from d731be8011 to 7f7d04882a

2024-07-10 11:44:56 +00:00

Compare

elebedeva reviewed 2024-07-10 12:02:01 +00:00

pkg/services/object/get/container.go Outdated

					
				@ -54,2 +64,4 @@

					}

					// In most cases, the value from the cache will be returned.

					// Container appears in the cache when traverser generates.

elebedeva commented

2024-07-10 12:02:01 +00:00

Looks like a grammar mistake: traverser generates --> traverser is generated.

Looks like a grammar mistake: `traverser generates` --> `traverser is generated`.

👍 1

dstepanov-yadro commented

2024-07-10 12:13:37 +00:00

Fixed

elebedeva marked this conversation as resolved

dstepanov-yadro force-pushed fix/ec_get_failover from 99790c09a0 to e89f8bb6c7

2024-07-10 12:17:28 +00:00

Compare

dstepanov-yadro force-pushed fix/ec_get_failover from e89f8bb6c7 to eebf7b4bfb

2024-07-10 12:21:09 +00:00

Compare

aarifullin approved these changes 2024-07-10 12:55:30 +00:00

acid-ant reviewed 2024-07-10 13:21:39 +00:00

pkg/services/object/get/remote.go Outdated

					
				@ -65,19 +65,10 @@ func (r *request) processNode(ctx context.Context, info client.NodeInfo) bool {

						mergeSplitInfo(r.splitInfo(), errSplitInfo.SplitInfo())

						r.err = objectSDK.NewSplitInfoError(r.infoSplit)

					case errors.As(err, &errECInfo):

acid-ant commented

2024-07-10 13:21:39 +00:00

Looks like part of the code in the default section is no more needed:

		if r.status == statusEC {
			// we need to continue getting another chunks  from another nodes
...

Looks like part of the code in the `default` section is no more needed: ``` if r.status == statusEC { // we need to continue getting another chunks from another nodes ... ```

dstepanov-yadro commented

2024-07-10 13:25:33 +00:00

done

acid-ant marked this conversation as resolved

dstepanov-yadro force-pushed fix/ec_get_failover from eebf7b4bfb to 5d36e847b0

2024-07-10 13:24:59 +00:00

Compare

acid-ant approved these changes 2024-07-10 14:15:36 +00:00

aarifullin approved these changes 2024-07-12 08:03:51 +00:00

dstepanov-yadro force-pushed fix/ec_get_failover from 5d36e847b0 to 906570789c

2024-07-12 08:44:28 +00:00

Compare

dstepanov-yadro force-pushed fix/ec_get_failover from 906570789c to 50b9dea0fb

2024-07-12 09:40:10 +00:00

Compare

achuprov approved these changes 2024-07-12 09:47:38 +00:00

aarifullin approved these changes 2024-07-12 12:52:35 +00:00

dstepanov-yadro changed title from ~~Process EC container for Get/GetRange/Head concurrently~~ to WIP: Process EC container for Get/GetRange/Head concurrently

2024-07-12 16:41:54 +00:00

dstepanov-yadro commented

2024-07-15 05:22:25 +00:00

Incorrect implementation, close

dstepanov-yadro closed this pull request

2024-07-15 05:22:40 +00:00