Fix EC put when some node is off #1233

Merged
fyrchik merged 2 commits from dstepanov-yadro/frostfs-node:fix/ec_put_node_off into master 2024-07-09 07:54:31 +00:00

Before:
In case of EC2.1 and 4 nodes frostfs-node save3 3 parts only to index specific nodes. If some node is down or off, then put fails with error.

After:
If some node is down or off, then frostfs-node tries first to save part to any node that doesn't contain any EC part, then tries to save part to any node.

Before: In case of EC2.1 and 4 nodes frostfs-node save3 3 parts only to index specific nodes. If some node is down or off, then put fails with error. After: If some node is down or off, then frostfs-node tries first to save part to any node that doesn't contain any EC part, then tries to save part to any node.
dstepanov-yadro force-pushed fix/ec_put_node_off from aa1050bbae to af40d73421 2024-07-08 08:45:32 +00:00 Compare
dstepanov-yadro requested review from storage-core-committers 2024-07-08 08:52:52 +00:00
dstepanov-yadro requested review from storage-core-developers 2024-07-08 08:53:05 +00:00
dstepanov-yadro force-pushed fix/ec_put_node_off from af40d73421 to 4752398532 2024-07-08 09:02:28 +00:00 Compare
fyrchik requested changes 2024-07-08 14:41:24 +00:00
@ -158,9 +160,8 @@ func (e *ecWriter) writeECPart(ctx context.Context, obj *objectSDK.Object) error
if len(nodes) == 0 {
break
}
Owner

Irrelevant to the commit.

Irrelevant to the commit.
Author
Member

fixed

fixed
@ -197,1 +198,4 @@
visited := make([]atomic.Bool, len(nodes))
for idx := range parts {
visited[idx%len(nodes)].Store(true)
Owner

What does this line achieve? If we have 2 loops in writeECPart, we will still iterate over all nodes.

What does this line achieve? If we have 2 loops in `writeECPart`, we will still iterate over all nodes.
Author
Member

If there are 3 parts and 4 nodes, and part 3 fails, so part 3 should try to save to node 4, not node 1 or node 2 (if node 1 or node 2 goroutines are not started yet).

If there are 3 parts and 4 nodes, and part 3 fails, so part 3 should try to save to node 4, not node 1 or node 2 (if node 1 or node 2 goroutines are not started yet).
@ -227,0 +263,4 @@
}
// try to save to any node not visited by current part
for i := 0; i < len(nodes); i++ {
Owner

If we are here, it means that either partVisited[idx] == true or we skipped some iteration from the previous loop in if !visited[idx].CompareAndSwap(false, true). If partVisited[idx] == true, we skip the node, so this loop iterates over those visited[idx] which it skipped in the previous loop.

The question is: why do we need 2 loops?

If we are here, it means that either `partVisited[idx] == true` or we skipped some iteration from the previous loop in `if !visited[idx].CompareAndSwap(false, true)`. If `partVisited[idx] == true`, we skip the node, so this loop iterates over those `visited[idx]` which it skipped in the previous loop. The question is: why do we need 2 loops?
Author
Member

It there are 3 parts and 4 nodes and part with index 2 fails on nodes with index 2 and 3, then state after first iteration will be:

visited[0] = true // visited by part 0
visited[1] = true // visited by part 1
visited[2] = true // visited by part 2 and failed
visited[3] = true // visited by part 2 and failed

So after first iteration part 3 will try to save to node 0 and 1.

It there are 3 parts and 4 nodes and part with index 2 fails on nodes with index 2 and 3, then state after first iteration will be: ``` visited[0] = true // visited by part 0 visited[1] = true // visited by part 1 visited[2] = true // visited by part 2 and failed visited[3] = true // visited by part 2 and failed ``` So after first iteration part 3 will try to save to node 0 and 1.
dstepanov-yadro force-pushed fix/ec_put_node_off from 4752398532 to 03a9a1a516 2024-07-08 14:55:57 +00:00 Compare
fyrchik approved these changes 2024-07-09 07:54:21 +00:00
fyrchik merged commit ca974b8b4c into master 2024-07-09 07:54:31 +00:00
Sign in to join this conversation.
No reviewers
TrueCloudLab/storage-core-developers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1233
No description provided.