Fix EC put when some node is off #1233
No reviewers
TrueCloudLab/storage-core-developers
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#1233
Loading…
Reference in a new issue
No description provided.
Delete branch "dstepanov-yadro/frostfs-node:fix/ec_put_node_off"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Before:
In case of EC2.1 and 4 nodes frostfs-node save3 3 parts only to index specific nodes. If some node is down or off, then put fails with error.
After:
If some node is down or off, then frostfs-node tries first to save part to any node that doesn't contain any EC part, then tries to save part to any node.
aa1050bbae
toaf40d73421
af40d73421
to4752398532
@ -158,9 +160,8 @@ func (e *ecWriter) writeECPart(ctx context.Context, obj *objectSDK.Object) error
if len(nodes) == 0 {
break
}
Irrelevant to the commit.
fixed
@ -197,1 +198,4 @@
visited := make([]atomic.Bool, len(nodes))
for idx := range parts {
visited[idx%len(nodes)].Store(true)
What does this line achieve? If we have 2 loops in
writeECPart
, we will still iterate over all nodes.If there are 3 parts and 4 nodes, and part 3 fails, so part 3 should try to save to node 4, not node 1 or node 2 (if node 1 or node 2 goroutines are not started yet).
@ -227,0 +263,4 @@
}
// try to save to any node not visited by current part
for i := 0; i < len(nodes); i++ {
If we are here, it means that either
partVisited[idx] == true
or we skipped some iteration from the previous loop inif !visited[idx].CompareAndSwap(false, true)
. IfpartVisited[idx] == true
, we skip the node, so this loop iterates over thosevisited[idx]
which it skipped in the previous loop.The question is: why do we need 2 loops?
It there are 3 parts and 4 nodes and part with index 2 fails on nodes with index 2 and 3, then state after first iteration will be:
So after first iteration part 3 will try to save to node 0 and 1.
4752398532
to03a9a1a516