Object loss: (not so) redundant local copy removed on all machines #92

Closed
opened 2023-03-08 23:17:36 +00:00 by snegurochka · 1 comment

Original issue: https://github.com/nspcc-dev/neofs-node/issues/2267

Expected Behavior

Object storage system should store objects. I mean, you put them and then get them back at any time until you delete them.

Current Behavior

So we have a four-node network with Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H container using this policy:

REP 2 IN X
CBF 2
SELECT 2 FROM F AS X
FILTER Deployed EQ NSPCC AS F

An AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1 object was uploaded into this container some (pretty long) time ago. It's stored on nodes 3 and 4 (there were some movements before the incident, but it's not relevant) until this happens:

Feb 15 17:25:49 node3 neofs-node[9352]: 2023-02-15T17:25:49.912Z        error        replicator/process.go:62        could not replicate object        {"component": "Object Replicator", "node": "029538f2e2de2beff5a380e672732461ccac0f1c6c255fc2c364dfc0356f7f5bd3", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "error": "(*putsvc.RemoteSender) could not send object: (*putsvc.remoteTarget) could not put object to [/dns4/st1.t5.fs.neo.org/tcp/8080]: write object via client: status: code = 2048 message = access to object operation denied"}
Feb 15 17:25:50 node3 neofs-node[9352]: 2023-02-15T17:25:50.139Z        error        replicator/process.go:62        could not replicate object        {"component": "Object Replicator", "node": "024cda0e7f60284295101465fc47e8a324da994b1cb39ba56825550025c9e9598b", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "error": "(*putsvc.RemoteSender) could not send object: (*putsvc.remoteTarget) could not put object to [/dns4/st2.t5.fs.neo.org/tcp/8080]: write object via client: status: code = 2048 message = access to object operation denied"}
Feb 15 17:25:50 node3 neofs-node[9352]: 2023-02-15T17:25:50.163Z        info        policer/check.go:129        redundant local object copy detected        {"component": "Object Policer", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1"}
Feb 15 17:25:50 node3 neofs-node[9352]: 2023-02-15T17:25:50.164Z        info        log/log.go:13        local object storage operation        {"shard_id": "Bcd5yGWeVfJgj2iKLdPiQC", "address": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "op": "db DELETE"}

and

Feb 15 17:25:51 node4 neofs-node[9802]: 2023-02-15T17:25:51.424Z        error        replicator/process.go:62        could not replicate object        {"component": "Object Replicator", "node": "029538f2e2de2beff5a380e672732461ccac0f1c6c255fc2c364dfc0356f7f5bd3", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "error": "(*putsvc.RemoteSender) could not send object: (*putsvc.remoteTarget) could not put object to [/dns4/st1.t5.fs.neo.org/tcp/8080]: write object via client: status: code = 2048 message = access to object operation denied"}
Feb 15 17:25:51 node4 neofs-node[9802]: 2023-02-15T17:25:51.603Z        error        replicator/process.go:62        could not replicate object        {"component": "Object Replicator", "node": "024cda0e7f60284295101465fc47e8a324da994b1cb39ba56825550025c9e9598b", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "error": "(*putsvc.RemoteSender) could not send object: (*putsvc.remoteTarget) could not put object to [/dns4/st2.t5.fs.neo.org/tcp/8080]: write object via client: status: code = 2048 message = access to object operation denied"}
Feb 15 17:25:51 node4 neofs-node[9802]: 2023-02-15T17:25:51.604Z        info        policer/check.go:129        redundant local object copy detected        {"component": "Object Policer", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1"}
Feb 15 17:25:51 node4 neofs-node[9802]: 2023-02-15T17:25:51.605Z        info        log/log.go:13        local object storage operation        {"shard_id": "2xcLzuvvNzKg8jELk1mn4q", "address": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "op": "db DELETE"}
Feb 15 17:26:35 node4 neofs-node[9802]: 2023-02-15T17:26:35.502Z        info        log/log.go:13        local object storage operation        {"shard_id": "2xcLzuvvNzKg8jELk1mn4q", "address": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "op": "metabase DELETE"}

Nodes 3 and 4 (holding the object) decide to move it to 1 and 2 at around the same time. Both fail to do so for some reason (which is not really important, replication can fail for a number of reasons). Both then delete their local copies. Object is gone. Forever.

Possible Solution

Looks like something is wrong in the logic ensuring a proper number of copies exists before deleting local one.

Context

Yeah, it's T5 testnet.

Your Environment

Node version 0.34.0.

Original issue: https://github.com/nspcc-dev/neofs-node/issues/2267 ## Expected Behavior Object storage system should store objects. I mean, you put them and then get them back at any time until you delete them. ## Current Behavior So we have a four-node network with `Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H` container using this policy: ``` REP 2 IN X CBF 2 SELECT 2 FROM F AS X FILTER Deployed EQ NSPCC AS F ``` An `AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1` object was uploaded into this container some (pretty long) time ago. It's stored on nodes 3 and 4 (there were some movements before the incident, but it's not relevant) until this happens: ``` Feb 15 17:25:49 node3 neofs-node[9352]: 2023-02-15T17:25:49.912Z error replicator/process.go:62 could not replicate object {"component": "Object Replicator", "node": "029538f2e2de2beff5a380e672732461ccac0f1c6c255fc2c364dfc0356f7f5bd3", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "error": "(*putsvc.RemoteSender) could not send object: (*putsvc.remoteTarget) could not put object to [/dns4/st1.t5.fs.neo.org/tcp/8080]: write object via client: status: code = 2048 message = access to object operation denied"} Feb 15 17:25:50 node3 neofs-node[9352]: 2023-02-15T17:25:50.139Z error replicator/process.go:62 could not replicate object {"component": "Object Replicator", "node": "024cda0e7f60284295101465fc47e8a324da994b1cb39ba56825550025c9e9598b", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "error": "(*putsvc.RemoteSender) could not send object: (*putsvc.remoteTarget) could not put object to [/dns4/st2.t5.fs.neo.org/tcp/8080]: write object via client: status: code = 2048 message = access to object operation denied"} Feb 15 17:25:50 node3 neofs-node[9352]: 2023-02-15T17:25:50.163Z info policer/check.go:129 redundant local object copy detected {"component": "Object Policer", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1"} Feb 15 17:25:50 node3 neofs-node[9352]: 2023-02-15T17:25:50.164Z info log/log.go:13 local object storage operation {"shard_id": "Bcd5yGWeVfJgj2iKLdPiQC", "address": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "op": "db DELETE"} ``` and ``` Feb 15 17:25:51 node4 neofs-node[9802]: 2023-02-15T17:25:51.424Z error replicator/process.go:62 could not replicate object {"component": "Object Replicator", "node": "029538f2e2de2beff5a380e672732461ccac0f1c6c255fc2c364dfc0356f7f5bd3", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "error": "(*putsvc.RemoteSender) could not send object: (*putsvc.remoteTarget) could not put object to [/dns4/st1.t5.fs.neo.org/tcp/8080]: write object via client: status: code = 2048 message = access to object operation denied"} Feb 15 17:25:51 node4 neofs-node[9802]: 2023-02-15T17:25:51.603Z error replicator/process.go:62 could not replicate object {"component": "Object Replicator", "node": "024cda0e7f60284295101465fc47e8a324da994b1cb39ba56825550025c9e9598b", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "error": "(*putsvc.RemoteSender) could not send object: (*putsvc.remoteTarget) could not put object to [/dns4/st2.t5.fs.neo.org/tcp/8080]: write object via client: status: code = 2048 message = access to object operation denied"} Feb 15 17:25:51 node4 neofs-node[9802]: 2023-02-15T17:25:51.604Z info policer/check.go:129 redundant local object copy detected {"component": "Object Policer", "object": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1"} Feb 15 17:25:51 node4 neofs-node[9802]: 2023-02-15T17:25:51.605Z info log/log.go:13 local object storage operation {"shard_id": "2xcLzuvvNzKg8jELk1mn4q", "address": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "op": "db DELETE"} Feb 15 17:26:35 node4 neofs-node[9802]: 2023-02-15T17:26:35.502Z info log/log.go:13 local object storage operation {"shard_id": "2xcLzuvvNzKg8jELk1mn4q", "address": "Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H/AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1", "op": "metabase DELETE"} ``` Nodes 3 and 4 (holding the object) decide to move it to 1 and 2 at around the same time. Both fail to do so for some reason (which is not really important, replication can fail for a number of reasons). Both then delete their local copies. Object is gone. Forever. ## Possible Solution Looks like something is wrong in the logic ensuring a proper number of copies exists before deleting local one. ## Context Yeah, it's T5 testnet. ## Your Environment Node version 0.34.0.
fyrchik added this to the v0.37.0 milestone 2023-05-18 08:48:01 +00:00
fyrchik added the
bug
label 2023-05-18 08:48:05 +00:00

Need to verify that all changes from the neofs-node have been picked.

Need to verify that all changes from the neofs-node have been picked.
ale64bit was assigned by fyrchik 2023-06-26 10:06:41 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#92
There is no content yet.