engine, policer: Properly process error already removed
#813
No reviewers
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#813
Loading…
Reference in a new issue
No description provided.
Delete branch "acid-ant/frostfs-node:bugfix/799-policer-skip-already-removed"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Close #799
Here is a scenario:
REP 2 ...
policer
on node 2 choose node 3 to replicate object on it.policer
on node 2 trying to put object to node 3. Choose the first shard - got erroralready removed
, put object on the next shard.We need to skip nodes which already removed object in policer.
Also skip put on the shards when faced an error
object already removed
- that was initial issue.Unfortunately, it is impossible to cover this line via unittest.
Signed-off-by: Anton Nikiforov an.nikiforov@yadro.com
[#799] policer: Skip nodes which already removed objectto policer: Skip nodes which already removed object1861bdd00a
to641070e159
@ -145,3 +135,1 @@
zap.String("error", err.Error()),
)
} else {
if err == nil || client.IsErrObjectAlreadyRemoved(err) {
If object is already removed on other node, so we can just skip it on current node, because it will be skiped on the next policer run.
Updated a bit, propagate error to the upper layer.
Am I right in describing the scenario?
Because node 3 was offline at the moment of object removal, it doesn't has tombstone and replication successfully finishes.
In this case we still will replicate tombstone later, so the problem is only in speed?
641070e159
toe55ef24f5f
policer: Skip nodes which already removed objectto WIP: policer: Skip nodes which already removed object43112edc08
todfad3ce06a
WIP: policer: Skip nodes which already removed objectto engine, policer: Properly process erroralready removed
@ -99,3 +100,1 @@
func (e *StorageEngine) putToShard(ctx context.Context, sh hashedShard, ind int, pool util.WorkerPool, addr oid.Address, obj *objectSDK.Object) (bool, bool) {
var putSuccess, alreadyExists bool
// Third return value is true iff object already removed. In this case forth value also not nil.
Can we introduce
putToShardRes
now? Return values are named here, but it is easy to misorder on the callsize.Yeah, thought about this too. If others do not mind about this, I'll add it.
@TrueCloudLab/storage-core-developers @TrueCloudLab/storage-core-committers
@ -150,6 +152,15 @@ func (e *StorageEngine) putToShard(ctx context.Context, sh hashedShard, ind int,
zap.String("trace_id", tracingPkg.GetTraceID(ctx)))
return
}
if client.IsErrObjectAlreadyRemoved(err) {
What about a test? (2 shards, put -> delete -> put, the first shard should return AlreadyRemoved, the second should return NotFound)
As a sanity check, it should fail on master.
To reproduce this bug, we need three concurrent writes - two for object and one for tomb. For your proposal, both master and current branch will allow to
put->delete->put
and object will be on the first shard always.It is hard to emulate concurrent usages of
engine
.In the scope of this task we are not avoiding race in common, we just properly react for one of the possible error which my lead to duplication of object on shards. Because we are checking existence of the object before iteration over shards.
The same may happen when disk is full or blinking - we are not covering these errors and just go to the next shard.
Also, that may happen when
policers
from two nodes start putting object on third, because disk was removed and no more object on nodedfad3ce06a
toef74f8da89
ef74f8da89
to11a63eb931
11a63eb931
to28a53162af
28a53162af
tof63b23a382
@ -91,0 +102,4 @@
return errPutShard
case putToShardRemoved:
return shRes.err
default:
Having
return nil
in the default branch is dangerous IMO, because if we add and error and forget to add the case here, this is DL scenario (OK to client, don't save in reality).Agree, I've updated this switch.
@ -91,0 +103,4 @@
case putToShardRemoved:
return shRes.err
default:
return nil
How about returning nil for explicit statuses enumeration?
It is easy to miss new here and return OK without object being put.
old comment, didn't see
@ -150,12 +165,21 @@ func (e *StorageEngine) putToShard(ctx context.Context, sh hashedShard, ind int,
zap.String("trace_id", tracingPkg.GetTraceID(ctx)))
return
}
if client.IsErrObjectAlreadyRemoved(err) {
If we remove an object with
frostfs-cli control drop-objects
, what kind of error will we receive?frostfs-cli control drop-objects
callsshard.Inhume()
, so we will have hereAlreadyRemoved
.@ -148,2 +138,4 @@
shortage--
checkedNodes.submitReplicaHolder(nodes[i])
} else {
if client.IsErrObjectAlreadyRemoved(err) {
Do we need the changes in policer even after we have changed logic in shard.put?
I think yes. The idea here is to fail faster and do not process other nodes in queue.
f63b23a382
toaee3e4b14d
aee3e4b14d
to062152a9d7
@ -99,3 +115,1 @@
func (e *StorageEngine) putToShard(ctx context.Context, sh hashedShard, ind int, pool util.WorkerPool, addr oid.Address, obj *objectSDK.Object) (bool, bool) {
var putSuccess, alreadyExists bool
// Third return value is true iff object already removed. In this case forth value also not nil.
Comment is not actual
Updated, thanks.
062152a9d7
to3a3f79c12d
3a3f79c12d
to50e916a41d
Need to think about bellow scenario:
Looks like we need to continue process nodes in policer, because inside policer we aren't able to check by whom object removed.
50e916a41d
toac24c3f1ac
Updated commit into
policer
- now it contains only refactoring for methodprocessNodes
.ac24c3f1ac
toa61f70c33b
a61f70c33b
to3686122b0d
3686122b0d
to716e831f55
@ -132,21 +132,21 @@ func (p *Policer) processNodes(ctx context.Context, requirements *placementRequi
cancel()
if client.IsErrObjectNotFound(err) {
What do we achiev with this refactoring?
Had a decision to keep this refactoring because
err
will benil
more often, so it is not necessary to check for the type of the error at first. In normal situation, node holds replicas.716e831f55
toe73af83b6e
e73af83b6e
to4ce6ad711b
4ce6ad711b
to6d39528d13
6d39528d13
todb247442dd