Blobovniczas initialization won't return an explicit error #1624

Closed
opened 2025-01-30 12:50:26 +00:00 by aarifullin · 6 comments
Member

Expected Behavior

initializeDBs returns an error with a real failure reason instead of context canceled

Current Behavior

When BlobStor initializes Blobovniczas, it gets context canceled instead of an explicit error. This happens because errGroup cancels the context when it gets error from some invocation and propagates it down.
So, we can't find out the real problem during after failed initialization.

янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.168Z        error        engine/control.go:106        could not initialize shard, closing and skipping        {"id": "EqiVdEiux21FoankwWc5N8", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"}
янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.167Z        error        engine/control.go:106        could not initialize shard, closing and skipping        {"id": "HHPDUzs6dDJHBaWreakfbA", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"}
янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.167Z        error        engine/control.go:106        could not initialize shard, closing and skipping        {"id": "5Z9L7z3VUgvJx5TUumDj6i", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"}
янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.167Z        error        engine/control.go:106        could not initialize shard, closing and skipping        {"id": "W7DA9fuY4E6YkyJqhaQgUv", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"}
янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.167Z        error        engine/control.go:106        could not initialize shard, closing and skipping        {"id": "TJg2bhUYFAShnzB9kTUm81", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"}

Possible Solution

Send err to a channel in iterateIncompletedRebuildDBPaths

    errChan := make(chan error, 1)
    
    err = b.iterateExistingDBPaths(egCtx, func(p string) (bool, error) {
        visited[p] = struct{}{}
        eg.Go(func() error {
            shBlz := b.getBlobovniczaWithoutCaching(p)
            blz, err := shBlz.Open()
            if err != nil {
                select {
                case errChan <- err:
                default:
                }
                return err
            }
            defer shBlz.Close()

            moveInfo, err := blz.ListMoveInfo(egCtx)
            if err != nil {
                select {
                case errChan <- err:
                default:
                }
                return err
            }
            for _, move := range moveInfo {
                b.deleteProtectedObjects.Add(move.Address)
            }

            b.log.Debug(logs.BlobovniczatreeBlobovniczaSuccessfullyInitializedClosing, zap.String("id", p))
            return nil
        })
        return false, nil
    })

    if err := eg.Wait(); err != nil {
        select {
        case firstErr := <-errChan:
            return firstErr
        default:
            return err
        }
    }

Regression

No

<!-- Provide a general summary of the issue in the Title above --> ## Expected Behavior `initializeDBs` returns an error with a real failure reason instead of `context canceled` ## Current Behavior When `BlobStor` initializes `Blobovniczas`, it gets `context canceled` instead of an explicit error. This happens because `errGroup` cancels the context when it gets error from some invocation and propagates it down. So, we can't find out the real problem during after failed initialization. ``` янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.168Z error engine/control.go:106 could not initialize shard, closing and skipping {"id": "EqiVdEiux21FoankwWc5N8", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"} янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.167Z error engine/control.go:106 could not initialize shard, closing and skipping {"id": "HHPDUzs6dDJHBaWreakfbA", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"} янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.167Z error engine/control.go:106 could not initialize shard, closing and skipping {"id": "5Z9L7z3VUgvJx5TUumDj6i", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"} янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.167Z error engine/control.go:106 could not initialize shard, closing and skipping {"id": "W7DA9fuY4E6YkyJqhaQgUv", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"} янв 30 10:52:13 d1backup2 frostfs-node[3603203]: 2025-01-30T10:52:13.167Z error engine/control.go:106 could not initialize shard, closing and skipping {"id": "TJg2bhUYFAShnzB9kTUm81", "error": "could not initialize *blobstor.BlobStor: failure on blobovnicza initialization stage: context canceled"} ``` ## Possible Solution Send err to a channel in [iterateIncompletedRebuildDBPaths](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/branch/master/pkg/local_object_storage/blobstor/blobovniczatree/control.go#L45-L67) ```go errChan := make(chan error, 1) err = b.iterateExistingDBPaths(egCtx, func(p string) (bool, error) { visited[p] = struct{}{} eg.Go(func() error { shBlz := b.getBlobovniczaWithoutCaching(p) blz, err := shBlz.Open() if err != nil { select { case errChan <- err: default: } return err } defer shBlz.Close() moveInfo, err := blz.ListMoveInfo(egCtx) if err != nil { select { case errChan <- err: default: } return err } for _, move := range moveInfo { b.deleteProtectedObjects.Add(move.Address) } b.log.Debug(logs.BlobovniczatreeBlobovniczaSuccessfullyInitializedClosing, zap.String("id", p)) return nil }) return false, nil }) if err := eg.Wait(); err != nil { select { case firstErr := <-errChan: return firstErr default: return err } } ``` ## Regression No
aarifullin added the
bug
triage
labels 2025-01-30 12:50:26 +00:00
aarifullin added the
good first issue
label 2025-01-30 12:55:53 +00:00
Owner

I believe eg.Wait() should return exactly the first error it has got, no?

I believe `eg.Wait()` should return exactly the first error it has got, no?
Member

Can context.Cause help?

Can [`context.Cause`](https://pkg.go.dev/context#Cause) help?
Author
Member

I believe eg.Wait() should return exactly the first error it has got, no?

We'll check soon

> I believe `eg.Wait()` should return exactly the first error it has got, no? We'll check soon

I think the problem is that b.iterateExistingDBPaths(egCtx, func(p string) (bool, error) uses errgroup's context. So when errgroup fails with an error, egCtx is canceled, but iterateExistingDBPaths fails before eg.Wait.

I think the problem is that `b.iterateExistingDBPaths(egCtx, func(p string) (bool, error)` uses errgroup's context. So when errgroup fails with an error, `egCtx` is canceled, but `iterateExistingDBPaths` fails before `eg.Wait`.
Author
Member

I think the problem is that b.iterateExistingDBPaths(egCtx, func(p string) (bool, error) uses errgroup's context. So when errgroup fails with an error, egCtx is canceled, but iterateExistingDBPaths fails before eg.Wait.

Yeah. iterateExistingDBPaths looks bizarre a bit. It may not launch eg.Go at all if some internal error occurs and _ = eg.Wait() is being ignored. So, that's why I suggested to introduce the error channel

> I think the problem is that `b.iterateExistingDBPaths(egCtx, func(p string) (bool, error)` uses errgroup's context. So when errgroup fails with an error, `egCtx` is canceled, but `iterateExistingDBPaths` fails before `eg.Wait`. Yeah. `iterateExistingDBPaths` looks bizarre a bit. It may not launch `eg.Go` at all if some internal error occurs and `_ = eg.Wait()` is being ignored. So, that's why I suggested to introduce the error channel

Or just move iterateExistingDBPaths to eg.Go

Or just move `iterateExistingDBPaths` to `eg.Go`
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1624
No description provided.