Storage engine mayn't properly handle context related errors #1709

New issue

Closed

opened 2025-04-07 10:15:49 +00:00 by a-savchuk · 0 comments

a-savchuk commented

2025-04-07 10:15:49 +00:00

Member

When trying to put an object, a storage engine iterates all the shards. If shard.Put returns an error, based on that error the engine decides whether it should stop or not. See the related code snippet below:

 	_, err = sh.Put(ctx, putPrm)
 	if err != nil {
 		if errors.Is(err, shard.ErrReadOnlyMode) || errors.Is(err, blobstor.ErrNoPlaceFound) ||
 			errors.Is(err, common.ErrReadOnly) || errors.Is(err, common.ErrNoSpace) {
 			e.log.Warn(ctx, logs.EngineCouldNotPutObjectToShard,
 				zap.Stringer("shard_id", sh.ID()),
 				zap.Error(err))
 			return
 		}
 		if client.IsErrObjectAlreadyRemoved(err) {
 			e.log.Warn(ctx, logs.EngineCouldNotPutObjectToShard,
 				zap.Stringer("shard_id", sh.ID()),
 				zap.Error(err))
 			res.status = putToShardRemoved
 			res.err = err
 			return
 		}
 		e.reportShardError(ctx, sh, "could not put object to shard", err, zap.Stringer("address", addr))
 		return
 	}

After introducing shard OPS limiter, I think the following case is possible:

the the operation exceeded deadline when trying to put an object on some shard

after that engine kept iterating the shards and getting the same error again and again because of the limiter, i. e. the limiter pushes a request and then immediately drop it

 func (q *MClock) RequestArrival(ctx context.Context, tag string) (ReleaseFunc, error) {
 	req, release, err := q.pushRequest(tag)
 	if err != nil {
 		return nil, err
 	}
 	select {
 	case <-ctx.Done():
 		q.dropRequest(req)
 		return nil, ctx.Err()
 	case <-req.scheduled:
 		return release, nil
 	case <-req.canceled:
 		return nil, ErrMClockSchedulerClosed
 	}
 }

Can this approach affect entire operation performance?

Also, now the engine returns "can't put an object to any shard" on context cancel and deadline exceeded. Can it return context related errors? If yes, then how this can affect services when they traverse nodes?

When trying to put an object, a storage engine iterates all the shards. If `shard.Put` returns an error, based on that error the engine decides whether it should stop or not. See the related code snippet below: https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/c4f941a5f5217024ed8ba3cebda22d3518bd18f7/pkg/local_object_storage/engine/put.go#L154-L174 After introducing shard OPS limiter, I think the following case is possible: - the the operation exceeded deadline when trying to put an object on some shard - after that engine kept iterating the shards and getting the same error again and again because of the limiter, i. e. the limiter pushes a request and then immediately drop it https://git.frostfs.info/TrueCloudLab/frostfs-qos/src/commit/b5ed0b6eff475ecaa61e1a33b5346f449806ee37/scheduling/mclock.go#L122-L136 Can this approach affect entire operation performance? Also, now the engine returns "can't put an object to any shard" on context cancel and deadline exceeded. Can it return context related errors? If yes, then how this can affect services when they traverse nodes?