Storage engine mayn't properly handle context related errors #1709
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#1709
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When trying to put an object, a storage engine iterates all the shards. If
shard.Put
returns an error, based on that error the engine decides whether it should stop or not. See the related code snippet below:_, err = sh.Put(ctx, putPrm)
if err != nil {
if errors.Is(err, shard.ErrReadOnlyMode) || errors.Is(err, blobstor.ErrNoPlaceFound) ||
errors.Is(err, common.ErrReadOnly) || errors.Is(err, common.ErrNoSpace) {
e.log.Warn(ctx, logs.EngineCouldNotPutObjectToShard,
zap.Stringer("shard_id", sh.ID()),
zap.Error(err))
return
}
if client.IsErrObjectAlreadyRemoved(err) {
e.log.Warn(ctx, logs.EngineCouldNotPutObjectToShard,
zap.Stringer("shard_id", sh.ID()),
zap.Error(err))
res.status = putToShardRemoved
res.err = err
return
}
e.reportShardError(ctx, sh, "could not put object to shard", err, zap.Stringer("address", addr))
return
}
After introducing shard OPS limiter, I think the following case is possible:
func (q *MClock) RequestArrival(ctx context.Context, tag string) (ReleaseFunc, error) {
req, release, err := q.pushRequest(tag)
if err != nil {
return nil, err
}
select {
case <-ctx.Done():
q.dropRequest(req)
return nil, ctx.Err()
case <-req.scheduled:
return release, nil
case <-req.canceled:
return nil, ErrMClockSchedulerClosed
}
}
Can this approach affect entire operation performance?
Also, now the engine returns "can't put an object to any shard" on context cancel and deadline exceeded. Can it return context related errors? If yes, then how this can affect services when they traverse nodes?