Storage engine mayn't properly handle context related errors #1709

Closed
opened 2025-04-07 10:15:49 +00:00 by a-savchuk · 0 comments
Member

When trying to put an object, a storage engine iterates all the shards. If shard.Put returns an error, based on that error the engine decides whether it should stop or not. See the related code snippet below:

_, err = sh.Put(ctx, putPrm)
if err != nil {
if errors.Is(err, shard.ErrReadOnlyMode) || errors.Is(err, blobstor.ErrNoPlaceFound) ||
errors.Is(err, common.ErrReadOnly) || errors.Is(err, common.ErrNoSpace) {
e.log.Warn(ctx, logs.EngineCouldNotPutObjectToShard,
zap.Stringer("shard_id", sh.ID()),
zap.Error(err))
return
}
if client.IsErrObjectAlreadyRemoved(err) {
e.log.Warn(ctx, logs.EngineCouldNotPutObjectToShard,
zap.Stringer("shard_id", sh.ID()),
zap.Error(err))
res.status = putToShardRemoved
res.err = err
return
}
e.reportShardError(ctx, sh, "could not put object to shard", err, zap.Stringer("address", addr))
return
}

After introducing shard OPS limiter, I think the following case is possible:

  • the the operation exceeded deadline when trying to put an object on some shard
  • after that engine kept iterating the shards and getting the same error again and again because of the limiter, i. e. the limiter pushes a request and then immediately drop it

    func (q *MClock) RequestArrival(ctx context.Context, tag string) (ReleaseFunc, error) {
    req, release, err := q.pushRequest(tag)
    if err != nil {
    return nil, err
    }
    select {
    case <-ctx.Done():
    q.dropRequest(req)
    return nil, ctx.Err()
    case <-req.scheduled:
    return release, nil
    case <-req.canceled:
    return nil, ErrMClockSchedulerClosed
    }
    }

Can this approach affect entire operation performance?

Also, now the engine returns "can't put an object to any shard" on context cancel and deadline exceeded. Can it return context related errors? If yes, then how this can affect services when they traverse nodes?

When trying to put an object, a storage engine iterates all the shards. If `shard.Put` returns an error, based on that error the engine decides whether it should stop or not. See the related code snippet below: https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/c4f941a5f5217024ed8ba3cebda22d3518bd18f7/pkg/local_object_storage/engine/put.go#L154-L174 After introducing shard OPS limiter, I think the following case is possible: - the the operation exceeded deadline when trying to put an object on some shard - after that engine kept iterating the shards and getting the same error again and again because of the limiter, i. e. the limiter pushes a request and then immediately drop it https://git.frostfs.info/TrueCloudLab/frostfs-qos/src/commit/b5ed0b6eff475ecaa61e1a33b5346f449806ee37/scheduling/mclock.go#L122-L136 Can this approach affect entire operation performance? Also, now the engine returns "can't put an object to any shard" on context cancel and deadline exceeded. Can it return context related errors? If yes, then how this can affect services when they traverse nodes?
a-savchuk added the
discussion
frostfs-node
triage
perfomance
labels 2025-04-07 10:15:49 +00:00
dstepanov-yadro self-assigned this 2025-04-18 12:53:20 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1709
No description provided.