Add EC replication #1129

Merged
dstepanov-yadro merged 7 commits from dstepanov-yadro/frostfs-node:feat/ec_restore into master 2024-05-17 14:32:38 +00:00
  • Replicate EC chunk to another node

  • Pull EC chunk from another node

  • Lost EC chunk restore

Check scenario:

  • create an EC container
  • check object nodes with frostfs-cli
  • remove one node from netmap with frostfs-adm
  • put objects to created container
  • wait while removed node returns to netmap
  • check object nodes with frostfs-cli

Need to implement EC delete/inhume operations to get correct results.

* Replicate EC chunk to another node * Pull EC chunk from another node * Lost EC chunk restore Check scenario: - create an EC container - check object nodes with frostfs-cli - remove one node from netmap with frostfs-adm - put objects to created container - wait while removed node returns to netmap - check object nodes with frostfs-cli Need to implement EC delete/inhume operations to get correct results.
dstepanov-yadro added 1 commit 2024-05-13 13:53:31 +00:00
Build / Build Components (1.21) (pull_request) Successful in 2m38s Details
DCO action / DCO (pull_request) Successful in 2m37s Details
Vulncheck / Vulncheck (pull_request) Successful in 2m37s Details
Build / Build Components (1.22) (pull_request) Successful in 4m23s Details
Pre-commit hooks / Pre-commit (pull_request) Successful in 5m20s Details
Tests and linters / Staticcheck (pull_request) Successful in 6m25s Details
Tests and linters / Lint (pull_request) Successful in 7m28s Details
Tests and linters / gopls check (pull_request) Successful in 9m21s Details
Tests and linters / Tests (1.21) (pull_request) Successful in 10m35s Details
Tests and linters / Tests (1.22) (pull_request) Successful in 10m30s Details
Tests and linters / Tests with -race (pull_request) Successful in 10m34s Details
238aeee47c
[#9999] policer: Add EC chunk replication
Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
dstepanov-yadro force-pushed feat/ec_restore from 238aeee47c to 93b2e25f73 2024-05-13 14:11:13 +00:00 Compare
dstepanov-yadro added 1 commit 2024-05-14 11:44:05 +00:00
DCO action / DCO (pull_request) Successful in 2m4s Details
Build / Build Components (1.21) (pull_request) Successful in 3m3s Details
Build / Build Components (1.22) (pull_request) Successful in 3m59s Details
Vulncheck / Vulncheck (pull_request) Successful in 3m24s Details
Tests and linters / Staticcheck (pull_request) Failing after 5m48s Details
Pre-commit hooks / Pre-commit (pull_request) Successful in 6m43s Details
Tests and linters / gopls check (pull_request) Successful in 7m51s Details
Tests and linters / Lint (pull_request) Successful in 9m10s Details
Tests and linters / Tests (1.22) (pull_request) Successful in 10m59s Details
Tests and linters / Tests with -race (pull_request) Successful in 10m55s Details
Tests and linters / Tests (1.21) (pull_request) Successful in 11m7s Details
c3cfd4d6d6
[#1129] policer: Pull required EC chunks
Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
dstepanov-yadro force-pushed feat/ec_restore from c3cfd4d6d6 to c5e5c204c9 2024-05-15 07:25:07 +00:00 Compare
dstepanov-yadro force-pushed feat/ec_restore from c5e5c204c9 to 1afdf3409a 2024-05-15 11:40:56 +00:00 Compare
dstepanov-yadro changed title from WIP: Add EC replication to Add EC replication 2024-05-15 11:58:32 +00:00
dstepanov-yadro requested review from storage-core-committers 2024-05-15 11:58:39 +00:00
dstepanov-yadro requested review from storage-core-developers 2024-05-15 11:58:40 +00:00
fyrchik requested changes 2024-05-15 12:25:24 +00:00
@ -77,3 +77,3 @@
Nodes: nodes,
}
s.replicator.HandleTask(ctx, task, &res)
s.replicator.HandleReplicateTask(ctx, task, &res)

HandleReplicationTask? handle is a verb, replicate is a verb too

`HandleReplicationTask`? `handle` is a verb, `replicate` is a verb too
Poster
Collaborator

Done

Done
fyrchik marked this conversation as resolved
@ -53,11 +54,6 @@ func (p *Policer) processObject(ctx context.Context, addrWithType objectcore.Add
c := &placementRequirements{}
var numOfContainerNodes int

Seems like a separate fix, unrelated to the commit

Seems like a separate fix, unrelated to the commit
Poster
Collaborator

Done

Done
fyrchik marked this conversation as resolved
@ -0,0 +41,4 @@
c := &placementRequirements{}
checkedNodes := newNodeCache()
for i := range nn {

len(nn) is currently 1 for all EC policies.
Can we simplify the code having this in mind?
Because when it won't be 1, we will also likely have different EC X.Y or REP N descriptors for different indexes in this array.

The distinction between REP/EC could be done based on each nn element, not on the uppermost level.

`len(nn)` is currently 1 for all EC policies. Can we simplify the code having this in mind? Because when it won't be 1, we will also likely have different `EC X.Y` or `REP N` descriptors for different indexes in this array. The distinction between `REP`/`EC` could be done based on each `nn` element, not on the uppermost level.
Poster
Collaborator

fixed

fixed
fyrchik marked this conversation as resolved
@ -0,0 +48,4 @@
default:
}
rd := policy.ReplicaDescriptor(i)
// processNodes replaces rd.GetECDataCount() + rd.GetECParityCount() for len(nn[i]) for valid object type.

Could you elaborate a bit what this comment is trying to explain and why it is here?

Could you elaborate a bit what this comment is trying to explain and why it is here?
Poster
Collaborator

Refactored, see last commit.

Refactored, see last commit.
fyrchik marked this conversation as resolved
@ -0,0 +96,4 @@
}
// processECChunk replicates EC chunk if needed.
// Returns True if current chunk should be stored on current node.

It returns a struct.

It returns a `struct`.
Poster
Collaborator

Fixed

Fixed
fyrchik marked this conversation as resolved
@ -0,0 +177,4 @@
return false
}
func (p *Policer) collectRequiredECChunks(nodes []netmap.NodeInfo, objInfo objectcore.Info) map[uint32][]netmap.NodeInfo {

By design there is only 1 required chunk to store on the node (EC policy implies UNIQUE).
Why do we return a map here? It seems the code could be greatly simplified, having this in mind.

_By design_ there is only 1 required chunk to store on the node (EC policy implies `UNIQUE`). Why do we return a map here? It seems the code could be greatly simplified, having this in mind.
Poster
Collaborator

Not always: evacuation or not completed replication may lead that chunk may be stored on different chunks.

Not always: evacuation or not completed replication may lead that chunk may be stored on different chunks.

Evacuation does not affect the set of required chunks in any way.

Evacuation does not affect the set of required chunks in any way.
Poster
Collaborator

Agree, current implementation does not affect. But I think that in case of evacuation from a node, EC chunk should move to another node, even if it does not comply with the storage policy.

Agree, current implementation does not affect. But I think that in case of evacuation from a node, EC chunk should move to another node, even if it does not comply with the storage policy.

Yes, but it is moved to a node which should not store it, and I have thought this function determines whether the node should store it?
Another way to look at it: policer must depend only on the network map and placement policy, evacuation is a separate process.

Yes, but it is moved to a node which _should not_ store it, and I have thought this function determines whether the node _should_ store it? Another way to look at it: policer _must_ depend only on the network map and placement policy, evacuation is a separate process.

Actually, it may happen in a situation when we have EC 3.1 and 1 node went offline, so we have 3 nodes in the network map.
What is the behaviour in this case?

Actually, it may happen in a situation when we have `EC 3.1` and 1 node went offline, so we have 3 nodes in the network map. What is the behaviour in this case?
Poster
Collaborator

this function determines whether the node should store it

This function collects all nodes that store required chunks.

What is the behaviour in this case?

Policer will get an error from offline node and will skip object.

> this function determines whether the node should store it This function collects all nodes that store required chunks. > What is the behaviour in this case? Policer will get an error from offline node and will skip object.
fyrchik marked this conversation as resolved
dstepanov-yadro force-pushed feat/ec_restore from 1afdf3409a to 76bbd193da 2024-05-15 13:09:19 +00:00 Compare
dstepanov-yadro added 1 commit 2024-05-15 13:13:28 +00:00
DCO action / DCO (pull_request) Successful in 6m27s Details
Vulncheck / Vulncheck (pull_request) Successful in 7m21s Details
Build / Build Components (1.22) (pull_request) Successful in 9m26s Details
Build / Build Components (1.21) (pull_request) Successful in 9m43s Details
Tests and linters / gopls check (pull_request) Successful in 11m5s Details
Tests and linters / Staticcheck (pull_request) Successful in 11m55s Details
Tests and linters / Lint (pull_request) Successful in 14m9s Details
Tests and linters / Tests (1.22) (pull_request) Successful in 17m1s Details
Pre-commit hooks / Pre-commit (pull_request) Successful in 16m41s Details
Tests and linters / Tests (1.21) (pull_request) Successful in 17m52s Details
Tests and linters / Tests with -race (pull_request) Successful in 18m18s Details
779eb553a9
[#1129] policer: Refactor shortage
Drop override inside method.

Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
dstepanov-yadro force-pushed feat/ec_restore from 779eb553a9 to ed7f18e3f6 2024-05-15 13:18:06 +00:00 Compare
dstepanov-yadro added 2 commits 2024-05-16 07:06:19 +00:00
cf177a6098 [#1129] object: Fix check owner for EC part
Do not validate EC part owner if request from container node.

Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
Build / Build Components (1.21) (pull_request) Failing after 16s Details
DCO action / DCO (pull_request) Successful in 2m8s Details
Pre-commit hooks / Pre-commit (pull_request) Successful in 3m36s Details
Vulncheck / Vulncheck (pull_request) Successful in 3m12s Details
Build / Build Components (1.22) (pull_request) Successful in 4m15s Details
Tests and linters / gopls check (pull_request) Successful in 4m38s Details
Tests and linters / Staticcheck (pull_request) Successful in 4m54s Details
Tests and linters / Lint (pull_request) Successful in 5m45s Details
Tests and linters / Tests (1.22) (pull_request) Failing after 7m28s Details
Tests and linters / Tests (1.21) (pull_request) Failing after 7m35s Details
Tests and linters / Tests with -race (pull_request) Failing after 7m26s Details
851c929b62
[#1129] putSvc: Allow to put single unprepared object to EC container
Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
dstepanov-yadro force-pushed feat/ec_restore from 851c929b62 to 66f540df65 2024-05-16 07:13:17 +00:00 Compare
dstepanov-yadro added 1 commit 2024-05-16 13:28:31 +00:00
Vulncheck / Vulncheck (pull_request) Successful in 4m32s Details
DCO action / DCO (pull_request) Successful in 4m58s Details
Tests and linters / gopls check (pull_request) Successful in 7m36s Details
Tests and linters / Staticcheck (pull_request) Successful in 8m5s Details
Build / Build Components (1.21) (pull_request) Successful in 8m27s Details
Tests and linters / Lint (pull_request) Successful in 9m2s Details
Pre-commit hooks / Pre-commit (pull_request) Successful in 12m12s Details
Tests and linters / Tests (1.22) (pull_request) Successful in 13m38s Details
Tests and linters / Tests (1.21) (pull_request) Successful in 14m23s Details
Tests and linters / Tests with -race (pull_request) Successful in 14m48s Details
Build / Build Components (1.22) (pull_request) Successful in 18m26s Details
783ed62992
[#1129] policer: Restore EC object
Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
dstepanov-yadro force-pushed feat/ec_restore from 783ed62992 to a78805c555 2024-05-16 13:29:14 +00:00 Compare
aarifullin approved these changes 2024-05-16 20:31:53 +00:00
fyrchik reviewed 2024-05-17 07:29:51 +00:00
@ -248,2 +247,4 @@
return remoteReader.Head(ctx, prm)
},
),
policer.WithLocalObjectHeaderFunc(func(ctx context.Context, a oid.Address) (*objectSDK.Object, error) {

Why have you decided to go with 2 func options instead of an interface?

Why have you decided to go with 2 func options instead of an interface?
Poster
Collaborator

Because this approach has already been used in policer.

Because this approach has already been used in policer.
fyrchik marked this conversation as resolved
@ -97,2 +107,4 @@
return res.Header(), nil
}
func (h *RemoteReader) Get(ctx context.Context, prm *RemoteRequestPrm) (*objectSDK.Object, error) {

I am surprised to see Get in the head package, given that we also have get on the same level.
Why is it here?

I am surprised to see `Get` in the `head` package, given that we also have `get` on the same level. Why is it here?
Poster
Collaborator

Moved to objectSvc.

Moved to objectSvc.
fyrchik marked this conversation as resolved
@ -0,0 +293,4 @@
return
}
if objInfo.ECInfo.Total-uint32(len(resolved)) > policy.ReplicaDescriptor(0).GetECParityCount() {
var founded []uint32

Do you mean found? It is already a past participle from find, founded is what happened with Saint-Petersburg in 1703.

Do you mean `found`? It is already a past participle from `find`, `founded` is what happened with Saint-Petersburg in 1703.
Poster
Collaborator

Fixed

Fixed
fyrchik marked this conversation as resolved
@ -0,0 +323,4 @@
p.log.Error(logs.PolicerFailedToRestoreObject, zap.Stringer("object", parentAddress), zap.Error(err))
return
}
parts, err = c.Split(parentObject, key)

You call Reconstruct(), then you call Split on the result. These operations are inverse to each other.
What is the purpose?

You call `Reconstruct()`, then you call `Split` on the result. These operations are inverse to each other. What is the purpose?
Poster
Collaborator

Fixed

Fixed
fyrchik marked this conversation as resolved
dstepanov-yadro force-pushed feat/ec_restore from a78805c555 to 4e772b8eb4 2024-05-17 07:56:52 +00:00 Compare
dstepanov-yadro force-pushed feat/ec_restore from 4e772b8eb4 to 7727b7fe21 2024-05-17 07:59:37 +00:00 Compare
aarifullin reviewed 2024-05-17 08:44:59 +00:00
@ -0,0 +168,4 @@
var addr oid.Address
addr.SetContainer(objInfo.Address.Container())
addr.SetObject(indexToObjectID[index])
p.replicator.HandlePullTask(ctx, replicator.Task{
Collaborator

Just for curiosity: do we always need to pull missing chunks from remote node?
processECChunk sets removeLocal = true but a chunk, that is being replicated, is not removed yet until pullRequiredECChunks's run. Could a policer try to reconstruct missing chunk if the local node, by good fortune, keeps required chunks for reconstruction?

I suppose this is not a big deal and barely gives a big profit...

Just for curiosity: do we *always* need to pull missing chunks from remote node? `processECChunk` sets `removeLocal = true` but a chunk, that is being replicated, is not removed yet until `pullRequiredECChunks`'s run. Could a policer try to `reconstruct` missing chunk if the local node, *by good fortune*, keeps required chunks for reconstruction? I suppose this is not a big deal and barely gives a big profit...
Poster
Collaborator

By default local node should keep only one chunk.

By default local node should keep only one chunk.
aarifullin reviewed 2024-05-17 09:10:56 +00:00
@ -0,0 +272,4 @@
p.log.Error(logs.PolicerFailedToDecodeECChunkID, zap.Error(err), zap.Stringer("object", parentAddress))
return
}
if existed, exist := chunkIDs[ch.Index]; exist && existed != chunkID {
Collaborator

I found existed, exist is not appropriate pair name. You could rename chunkID to ecInfoChunkID, but existed to chunkID

I found `existed, exist` is not appropriate pair name. You could rename `chunkID` to `ecInfoChunkID`, but `existed` to `chunkID`
Poster
Collaborator

Done

Done
aarifullin marked this conversation as resolved
aarifullin reviewed 2024-05-17 09:19:07 +00:00
@ -0,0 +273,4 @@
return
}
if existed, exist := chunkIDs[ch.Index]; exist && existed != chunkID {
p.log.Error(logs.PolicerDifferentObjectIDForTheSameECChunk, zap.Stringer("first", existed),
Collaborator

Please, could you give a brief explanation how this could be that chunks related to the same parent may differ by IDs?

Please, could you give a brief explanation how this could be that chunks related to the same parent may differ by IDs?
Poster
Collaborator

A bug that I hope we don't have :)

A bug that I hope we don't have :)
aarifullin marked this conversation as resolved
dstepanov-yadro force-pushed feat/ec_restore from 7727b7fe21 to e9e5ba549e 2024-05-17 09:19:39 +00:00 Compare
aarifullin approved these changes 2024-05-17 09:30:29 +00:00
dstepanov-yadro force-pushed feat/ec_restore from e9e5ba549e to 436c9f5558 2024-05-17 11:36:32 +00:00 Compare
fyrchik approved these changes 2024-05-17 12:01:23 +00:00
acid-ant approved these changes 2024-05-17 14:20:56 +00:00
dstepanov-yadro merged commit 436c9f5558 into master 2024-05-17 14:32:38 +00:00
dstepanov-yadro deleted branch feat/ec_restore 2024-05-17 14:32:39 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1129
There is no content yet.