object: Fix Put for EC object when node unavailable #1427

Merged
acid-ant merged 1 commit from acid-ant/frostfs-node:bugfix/ec-writer into master 2024-10-14 07:23:49 +00:00
Member

Add test for ECWriter.

How to reproduce bug:

  • Create container with EC policy which selects only 3 nodes from 4
$ frostfs-cli -r s04.frostfs.devenv:8080 -c cfg.yml container create --policy 'EC 1.1 IN S CBF 1 SELECT 3 FROM F AS S FILTER NOT(County EQ Finland) AS F' --await --basic-acl public-read-write
CID: Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8
awaiting...
container has been persisted on sidechain
$ frostfs-cli -r s04.frostfs.devenv:8080 -c cfg.yml netmap nodeinfo
key: 038c862959e56b43e20f79187c4fe9e0bc7c8c66c1603e6cf0ec7f87ab6b08dc35
state: online
...
attribute: Country=Finland
...
$ frostfs-cli -r s03.frostfs.devenv:8080 -c cfg.yml netmap nodeinfo
key: 02ac920cd7df0b61b289072e6b946e2da4e1a31b9ab1c621bb475e30fa4ab102c3
  • Stop first two nodes which are container nodes
$ docker stop s01 s02
s01
s02
  • Try to put object
$ frostfs-cli -r s04.frostfs.devenv:8080 -c cfg.yml object put --cid Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8 --file /tmp/object.sample.dev --timeout 1h
 671 / 671 [=====================================================================================================================================================] 100.00% 0s
rpc error: client failure: rpc error: code = Unknown desc = could not close stream and receive response: could not close stream and receive response: (*putsvc.streamer) could not object put stream: (*putsvc.Streamer) could not close object target: could not write to next target: incomplete object PUT by placement: failed to save EC chunk Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8/3FjU18ZJttevu4S1b5TzeAUFLsy4UNMxumEcgPU3f5ak to any node
  • Check storage node log:
2024-10-11T09:34:12.045Z        warn    writer/ec.go:246        failed to save EC part  {"part_address": "Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8/3FjU18ZJttevu4S1b5TzeAUFLsy4UNMxumEcgPU3f5ak", "parent_address": "2KjWddp7EijVGfsKkbNb2T5cWYUWUvT7Me6GuAoaXvTg", "part_index": 1, "node": "022bb4041c50d607ff871dec7e4cd7778388e0ea6849d84ccbd9aa8f32e16a8131", "error": "(*writer.remoteWriter) could not put single object to [/dns4/s01.frostfs.devenv/tcp/8080]: put single object on client: client has recently failed, skipping"}


2024-10-11T09:34:12.045Z        warn    writer/ec.go:294        failed to save EC part  {"part_address": "Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8/3FjU18ZJttevu4S1b5TzeAUFLsy4UNMxumEcgPU3f5ak", "parent_address": "2KjWddp7EijVGfsKkbNb2T5cWYUWUvT7Me6GuAoaXvTg", "part_index": 1, "node": "03ff65b6ae79134a4dce9d0d39d3851e9bab4ee97abf86e81e1c5bbc50cd2826ae", "error": "(*writer.remoteWriter) could not put single object to [/dns4/s02.frostfs.devenv/tcp/8080]: put single object on client: client has recently failed, skipping"}

Signed-off-by: Anton Nikiforov an.nikiforov@yadro.com

Add test for `ECWriter`. How to reproduce bug: - Create container with EC policy which selects only 3 nodes from 4 ``` $ frostfs-cli -r s04.frostfs.devenv:8080 -c cfg.yml container create --policy 'EC 1.1 IN S CBF 1 SELECT 3 FROM F AS S FILTER NOT(County EQ Finland) AS F' --await --basic-acl public-read-write CID: Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8 awaiting... container has been persisted on sidechain $ frostfs-cli -r s04.frostfs.devenv:8080 -c cfg.yml netmap nodeinfo key: 038c862959e56b43e20f79187c4fe9e0bc7c8c66c1603e6cf0ec7f87ab6b08dc35 state: online ... attribute: Country=Finland ... $ frostfs-cli -r s03.frostfs.devenv:8080 -c cfg.yml netmap nodeinfo key: 02ac920cd7df0b61b289072e6b946e2da4e1a31b9ab1c621bb475e30fa4ab102c3 ``` - Stop first two nodes which are container nodes ``` $ docker stop s01 s02 s01 s02 ``` - Try to put object ``` $ frostfs-cli -r s04.frostfs.devenv:8080 -c cfg.yml object put --cid Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8 --file /tmp/object.sample.dev --timeout 1h 671 / 671 [=====================================================================================================================================================] 100.00% 0s rpc error: client failure: rpc error: code = Unknown desc = could not close stream and receive response: could not close stream and receive response: (*putsvc.streamer) could not object put stream: (*putsvc.Streamer) could not close object target: could not write to next target: incomplete object PUT by placement: failed to save EC chunk Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8/3FjU18ZJttevu4S1b5TzeAUFLsy4UNMxumEcgPU3f5ak to any node ``` - Check storage node log: ``` 2024-10-11T09:34:12.045Z warn writer/ec.go:246 failed to save EC part {"part_address": "Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8/3FjU18ZJttevu4S1b5TzeAUFLsy4UNMxumEcgPU3f5ak", "parent_address": "2KjWddp7EijVGfsKkbNb2T5cWYUWUvT7Me6GuAoaXvTg", "part_index": 1, "node": "022bb4041c50d607ff871dec7e4cd7778388e0ea6849d84ccbd9aa8f32e16a8131", "error": "(*writer.remoteWriter) could not put single object to [/dns4/s01.frostfs.devenv/tcp/8080]: put single object on client: client has recently failed, skipping"} 2024-10-11T09:34:12.045Z warn writer/ec.go:294 failed to save EC part {"part_address": "Asa4oPNuASfmvgmjpxjMAuWAfALBw6c4k3t1d3mK9oG8/3FjU18ZJttevu4S1b5TzeAUFLsy4UNMxumEcgPU3f5ak", "parent_address": "2KjWddp7EijVGfsKkbNb2T5cWYUWUvT7Me6GuAoaXvTg", "part_index": 1, "node": "03ff65b6ae79134a4dce9d0d39d3851e9bab4ee97abf86e81e1c5bbc50cd2826ae", "error": "(*writer.remoteWriter) could not put single object to [/dns4/s02.frostfs.devenv/tcp/8080]: put single object on client: client has recently failed, skipping"} ``` Signed-off-by: Anton Nikiforov <an.nikiforov@yadro.com>
acid-ant added 1 commit 2024-10-11 12:46:17 +00:00
[#xx] object: Fix Put for EC object when node unavailable
Some checks failed
DCO action / DCO (pull_request) Failing after 1m45s
Tests and linters / Run gofumpt (pull_request) Successful in 1m56s
Build / Build Components (pull_request) Successful in 2m15s
Tests and linters / Staticcheck (pull_request) Successful in 2m59s
Vulncheck / Vulncheck (pull_request) Successful in 3m17s
Pre-commit hooks / Pre-commit (pull_request) Successful in 3m48s
Tests and linters / gopls check (pull_request) Successful in 3m53s
Tests and linters / Lint (pull_request) Successful in 4m43s
Tests and linters / Tests (pull_request) Successful in 6m55s
Tests and linters / Tests with -race (pull_request) Successful in 7m26s
b7ad738537
Add test for `ECWriter`.

Signed-off-by: Anton Nikiforov <an.nikiforov@yadro.com>
acid-ant force-pushed bugfix/ec-writer from b7ad738537 to cd2b523a64 2024-10-11 12:51:11 +00:00 Compare
Owner

Could you describe what the bug was in the commit message?
I understand the diff, but fail to understand why the old code was incorrect.

Could you describe what the bug was in the commit message? I understand the diff, but fail to understand why the old code was incorrect.
fyrchik reviewed 2024-10-11 13:00:37 +00:00
@ -0,0 +79,4 @@
}
func (c multiAddressClient) ObjectDelete(_ context.Context, _ apiclient.PrmObjectDelete) (*apiclient.ResObjectDelete, error) {
panic("implement me")
Owner

We can use interface embedding for this purpose -- it will panic with NPE, but we won't have to write boilerplate.
This style is bad for real code, but good for tests IMO.

We can use interface embedding for this purpose -- it will panic with NPE, but we won't have to write boilerplate. This style is bad for real code, but good for tests IMO.
Author
Member

Agree, updated. Looks much better.

Agree, updated. Looks much better.
Author
Member

Could you describe what the bug was in the commit message?
I understand the diff, but fail to understand why the old code was incorrect.

With the old code, there might be situation when context canceled earlier then we move to another part of nodes.

> Could you describe what the bug was in the commit message? > I understand the diff, but fail to understand why the old code was incorrect. With the old code, there might be situation when context canceled earlier then we move to another part of nodes.
fyrchik reviewed 2024-10-11 13:06:02 +00:00
@ -0,0 +137,4 @@
data := make([]byte, 100)
_, _ = rand.Read(data)
var ver version.Version
Owner

version.Current()?

`version.Current()`?
Author
Member

Fixed.

Fixed.
fyrchik marked this conversation as resolved
@ -0,0 +192,4 @@
require.NoError(t, err)
}
func testNodeMatrix(t testing.TB, dim []int) ([][]netmap.NodeInfo, [][]string) {
Owner

Have you copied this function from somewhere?
I am wondering about dim which is an array, but only has length=1 at the call-site.

Have you copied this function from somewhere? I am wondering about `dim` which is an array, but only has length=1 at the call-site.
Author
Member

Yeah, copied. But in my defense, I will say that it has already been duplicated :-)

Yeah, copied. But in my defense, I will say that it has already been duplicated :-)
fyrchik marked this conversation as resolved
acid-ant force-pushed bugfix/ec-writer from cd2b523a64 to 549e54e436 2024-10-11 13:27:51 +00:00 Compare
acid-ant changed title from [#xx] object: Fix Put for EC object when node unavailable to object: Fix Put for EC object when node unavailable 2024-10-11 13:29:17 +00:00
acid-ant force-pushed bugfix/ec-writer from 549e54e436 to 518ed4be15 2024-10-11 13:42:55 +00:00 Compare
acid-ant requested review from storage-core-committers 2024-10-11 13:49:38 +00:00
acid-ant requested review from storage-core-developers 2024-10-11 13:49:38 +00:00
acid-ant force-pushed bugfix/ec-writer from 518ed4be15 to acd6eb1815 2024-10-11 14:02:47 +00:00 Compare
Author
Member

Commit message updated.

Commit message updated.
fyrchik reviewed 2024-10-11 19:01:56 +00:00
@ -223,1 +228,4 @@
})
}
}
err = eg.Wait()
Owner

The linter hasn't caught this, but it seems the error value is unused, unless it is the last loop iteration.
This was not a problem previously, because waiting was done once, outside the loop.

The linter hasn't caught this, but it seems the error value is unused, unless it is the last loop iteration. This was not a problem previously, because waiting was done once, outside the loop.
Owner

It this intentional?

It this intentional?
Author
Member

Yes, it is intentional. The idea is to iterate over all nodes and return only the last error. All other errors will be logged inside e.writePart(...).

Yes, it is intentional. The idea is to iterate over all nodes and return only the last error. All other errors will be logged inside `e.writePart(...)`.
dstepanov-yadro approved these changes 2024-10-14 07:18:04 +00:00
fyrchik approved these changes 2024-10-14 07:22:52 +00:00
acid-ant merged commit acd6eb1815 into master 2024-10-14 07:23:49 +00:00
acid-ant deleted branch bugfix/ec-writer 2024-10-14 07:23:51 +00:00
Sign in to join this conversation.
No reviewers
TrueCloudLab/storage-core-developers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1427
No description provided.