Clarify node behavior with maintenance status #594

Closed
opened 2023-08-10 13:06:01 +00:00 by dkirillov · 7 comments
Member

Scenario:

  1. Have 4 nodes (dev-env )
  2. Mark one node as maintenance
  3. Create container with REP 4 placement policy
  4. Create object
  5. Delete object

Current bevior

  • object put failed but 3 out of 4 node contains my object.
  • object delete failed
container created: 9vuZiGtzRewZDGv3Rb35xFE7QQUBcE2m8225b2DJuQPH
put object error: init writing on API client: client failure: rpc error: code = Unknown desc = could not close stream and receive response: could not close stream and receive response: (*putsvc.streamer) could not object put stream: (*putsvc.Streamer) could not close object target: could not write to next target: incomplete object PUT by placement: could not write header: (*putsvc.remoteTarget) could not put single object to [/dns4/s02.frostfs.devenv/tcp/8080]: put single object via client: status: code = 1027 message = node is under maintenance
found object: 8mSuV53anhXuJ73gg5Lfz4NN3x2B7KVHwb3mL3bLrVYr
delete object error: remove object via client: delete object on client: status: code = 1024 message = incomplete object PUT by placement: could not write header: (*putsvc.remoteTarget) could not put single object to [/dns4/s02.frostfs.devenv/tcp/8080]: put single object via client: status: code = 1027 message = node is under maintenance

if we set copies number to 2 for put operations we get

container created: DRdoDbHXhJQ7a8a8Y12xvXxNJ9arfSTTFaB36EU6b4Km
put object error: <nil>
found object: A5H4sUpMxKz98j8ehMp7QopfsuYpEev3it2oeeKYsLRA
delete object error: remove object via client: delete object on client: status: code = 1024 message = incomplete object PUT by placement: could not write header: (*putsvc.remoteTarget) could not put single object to [/dns4/s02.frostfs.devenv/tcp/8080]: put single object via client: status: code = 1027 message = node is under maintenance

Is such behavior expected ?

Steps to reproduce:
1-2:

frostfs-cli --endpoint "${DEV_ENV_NODE_2_CONTROL}" -w "${DEV_ENV_WALLET_NODE_2}"  control set-status --status maintenance
Network status update request successfully sent.

bin/frostfs-cli netmap snapshot -r "${DEV_ENV_NODE_1}" -g
Epoch: 7
Node 1: 022bb4041c50d607ff871dec7e4cd7778388e0ea6849d84ccbd9aa8f32e16a8131 ONLINE /dns4/s01.frostfs.devenv/tcp/8080 
        Continent: Europe
        Country: Russia
        CountryCode: RU
        Location: Moskva
        Price: 22
        SubDiv: Moskva
        SubDivCode: MOW
        UN-LOCODE: RU MOW
        User-Agent: FrostFS/0.34
Node 2: 02ac920cd7df0b61b289072e6b946e2da4e1a31b9ab1c621bb475e30fa4ab102c3 ONLINE /dns4/s03.frostfs.devenv/tcp/8080 
        Continent: Europe
        Country: Sweden
        CountryCode: SE
        Location: Stockholm
        Price: 11
        SubDiv: Stockholms l�n
        SubDivCode: AB
        UN-LOCODE: SE STO
        User-Agent: FrostFS/0.34
Node 3: 038c862959e56b43e20f79187c4fe9e0bc7c8c66c1603e6cf0ec7f87ab6b08dc35 ONLINE /dns4/s04.frostfs.devenv/tcp/8082/tls /dns4/s04.frostfs.devenv/tcp/8080 
        Continent: Europe
        Country: Finland
        CountryCode: FI
        Location: Helsinki (Helsingfors)
        Price: 44
        SubDiv: Uusimaa
        SubDivCode: 18
        UN-LOCODE: FI HEL
        User-Agent: FrostFS/0.34
Node 4: 03ff65b6ae79134a4dce9d0d39d3851e9bab4ee97abf86e81e1c5bbc50cd2826ae MAINTENANCE /dns4/s02.frostfs.devenv/tcp/8080 
        Continent: Europe
        Country: Russia
        CountryCode: RU
        Location: Saint Petersburg (ex Leningrad)
        Price: 33
        SubDiv: Sankt-Peterburg
        SubDivCode: SPE
        UN-LOCODE: RU LED
        User-Agent: FrostFS/0.34

step 3-5:
(test must be run in frostf-sdk-go/pool package)


func TestPool(t *testing.T) {
	ctx := context.Background()
	devenvKey := "1dd37fba80fec4e6a6f13fd708d8dcb3b29def768017052f6c930fa1c5d90bbb"
	key, err := keys.NewPrivateKeyFromHex(devenvKey)
	require.NoError(t, err)
	var devenvOwner user.ID
	user.IDFromKey(&devenvOwner, key.PrivateKey.PublicKey)

	var prm InitParameters
	prm.SetKey(&key.PrivateKey)
	prm.SetLogger(zaptest.NewLogger(t))
	prm.SetNodeDialTimeout(3 * time.Second)
	prm.SetClientRebalanceInterval(5 * time.Second)
	prm.AddNode(NewNodeParam(1, "s01.frostfs.devenv:8080", 1))
	clientPool, err := NewPool(prm)
	require.NoError(t, err)
	err = clientPool.Dial(ctx)
	require.NoError(t, err)

	var pp netmap.PlacementPolicy
	err = pp.DecodeString("REP 4")
	require.NoError(t, err)

	var cnr container.Container
	cnr.Init()
	cnr.SetPlacementPolicy(pp)
	cnr.SetOwner(devenvOwner)
	cnr.SetBasicACL(acl.PublicRWExtended)

	err = SyncContainerWithNetwork(ctx, &cnr, clientPool)
	require.NoError(t, err)

	cnrID, err := clientPool.PutContainer(ctx, PrmContainerPut{
		ClientParams: sdkClient.PrmContainerPut{Container: &cnr},
	})
	require.NoError(t, err)

	fmt.Println("container created:", cnrID.EncodeToString())

	obj := object.New()
	obj.SetContainerID(cnrID)
	obj.SetOwnerID(&devenvOwner)

	var prmPut PrmObjectPut
	prmPut.SetHeader(*obj)
	prmPut.SetPayload(bytes.NewBufferString("content"))

	_, err = clientPool.PutObject(ctx, prmPut)
	fmt.Println("put object error:", err)

	filter := object.NewSearchFilters()
	filter.AddRootFilter()
	var prmSearch PrmObjectSearch
	prmSearch.SetContainerID(cnrID)
	prmSearch.SetFilters(filter)

	resSearch, err := clientPool.SearchObjects(ctx, prmSearch)
	require.NoError(t, err)

	var list []oid.ID
	err = resSearch.Iterate(func(id oid.ID) bool {
		fmt.Println("found object:", id.EncodeToString())
		list = append(list, id)
		return false
	})
	require.NoError(t, err)
	require.Len(t, list, 1)

	var addr oid.Address
	addr.SetContainer(cnrID)
	addr.SetObject(list[0])
	var prmGet PrmObjectGet
	prmGet.SetAddress(addr)

	resGet, err := clientPool.GetObject(ctx, prmGet)
	require.NoError(t, err)
	data, err := io.ReadAll(resGet.Payload)
	require.NoError(t, err)
	require.Equal(t, "content", string(data))

	var prmDelete PrmObjectDelete
	prmDelete.SetAddress(addr)
	err = clientPool.DeleteObject(ctx, prmDelete)
	fmt.Println("delete object error:", err)
}

Node version: c3e23a14 (current master)

Scenario: 1. Have 4 nodes (dev-env ) 2. Mark one node as maintenance 3. Create container with `REP 4` placement policy 4. Create object 5. Delete object Current bevior * object put failed but 3 out of 4 node contains my object. * object delete failed ``` container created: 9vuZiGtzRewZDGv3Rb35xFE7QQUBcE2m8225b2DJuQPH put object error: init writing on API client: client failure: rpc error: code = Unknown desc = could not close stream and receive response: could not close stream and receive response: (*putsvc.streamer) could not object put stream: (*putsvc.Streamer) could not close object target: could not write to next target: incomplete object PUT by placement: could not write header: (*putsvc.remoteTarget) could not put single object to [/dns4/s02.frostfs.devenv/tcp/8080]: put single object via client: status: code = 1027 message = node is under maintenance found object: 8mSuV53anhXuJ73gg5Lfz4NN3x2B7KVHwb3mL3bLrVYr delete object error: remove object via client: delete object on client: status: code = 1024 message = incomplete object PUT by placement: could not write header: (*putsvc.remoteTarget) could not put single object to [/dns4/s02.frostfs.devenv/tcp/8080]: put single object via client: status: code = 1027 message = node is under maintenance ``` if we set copies number to `2` for `put` operations we get ``` container created: DRdoDbHXhJQ7a8a8Y12xvXxNJ9arfSTTFaB36EU6b4Km put object error: <nil> found object: A5H4sUpMxKz98j8ehMp7QopfsuYpEev3it2oeeKYsLRA delete object error: remove object via client: delete object on client: status: code = 1024 message = incomplete object PUT by placement: could not write header: (*putsvc.remoteTarget) could not put single object to [/dns4/s02.frostfs.devenv/tcp/8080]: put single object via client: status: code = 1027 message = node is under maintenance ``` Is such behavior expected ? Steps to reproduce: 1-2: ``` frostfs-cli --endpoint "${DEV_ENV_NODE_2_CONTROL}" -w "${DEV_ENV_WALLET_NODE_2}" control set-status --status maintenance Network status update request successfully sent. bin/frostfs-cli netmap snapshot -r "${DEV_ENV_NODE_1}" -g Epoch: 7 Node 1: 022bb4041c50d607ff871dec7e4cd7778388e0ea6849d84ccbd9aa8f32e16a8131 ONLINE /dns4/s01.frostfs.devenv/tcp/8080 Continent: Europe Country: Russia CountryCode: RU Location: Moskva Price: 22 SubDiv: Moskva SubDivCode: MOW UN-LOCODE: RU MOW User-Agent: FrostFS/0.34 Node 2: 02ac920cd7df0b61b289072e6b946e2da4e1a31b9ab1c621bb475e30fa4ab102c3 ONLINE /dns4/s03.frostfs.devenv/tcp/8080 Continent: Europe Country: Sweden CountryCode: SE Location: Stockholm Price: 11 SubDiv: Stockholms l�n SubDivCode: AB UN-LOCODE: SE STO User-Agent: FrostFS/0.34 Node 3: 038c862959e56b43e20f79187c4fe9e0bc7c8c66c1603e6cf0ec7f87ab6b08dc35 ONLINE /dns4/s04.frostfs.devenv/tcp/8082/tls /dns4/s04.frostfs.devenv/tcp/8080 Continent: Europe Country: Finland CountryCode: FI Location: Helsinki (Helsingfors) Price: 44 SubDiv: Uusimaa SubDivCode: 18 UN-LOCODE: FI HEL User-Agent: FrostFS/0.34 Node 4: 03ff65b6ae79134a4dce9d0d39d3851e9bab4ee97abf86e81e1c5bbc50cd2826ae MAINTENANCE /dns4/s02.frostfs.devenv/tcp/8080 Continent: Europe Country: Russia CountryCode: RU Location: Saint Petersburg (ex Leningrad) Price: 33 SubDiv: Sankt-Peterburg SubDivCode: SPE UN-LOCODE: RU LED User-Agent: FrostFS/0.34 ``` step 3-5: (test must be run in `frostf-sdk-go/pool` package) ```golang func TestPool(t *testing.T) { ctx := context.Background() devenvKey := "1dd37fba80fec4e6a6f13fd708d8dcb3b29def768017052f6c930fa1c5d90bbb" key, err := keys.NewPrivateKeyFromHex(devenvKey) require.NoError(t, err) var devenvOwner user.ID user.IDFromKey(&devenvOwner, key.PrivateKey.PublicKey) var prm InitParameters prm.SetKey(&key.PrivateKey) prm.SetLogger(zaptest.NewLogger(t)) prm.SetNodeDialTimeout(3 * time.Second) prm.SetClientRebalanceInterval(5 * time.Second) prm.AddNode(NewNodeParam(1, "s01.frostfs.devenv:8080", 1)) clientPool, err := NewPool(prm) require.NoError(t, err) err = clientPool.Dial(ctx) require.NoError(t, err) var pp netmap.PlacementPolicy err = pp.DecodeString("REP 4") require.NoError(t, err) var cnr container.Container cnr.Init() cnr.SetPlacementPolicy(pp) cnr.SetOwner(devenvOwner) cnr.SetBasicACL(acl.PublicRWExtended) err = SyncContainerWithNetwork(ctx, &cnr, clientPool) require.NoError(t, err) cnrID, err := clientPool.PutContainer(ctx, PrmContainerPut{ ClientParams: sdkClient.PrmContainerPut{Container: &cnr}, }) require.NoError(t, err) fmt.Println("container created:", cnrID.EncodeToString()) obj := object.New() obj.SetContainerID(cnrID) obj.SetOwnerID(&devenvOwner) var prmPut PrmObjectPut prmPut.SetHeader(*obj) prmPut.SetPayload(bytes.NewBufferString("content")) _, err = clientPool.PutObject(ctx, prmPut) fmt.Println("put object error:", err) filter := object.NewSearchFilters() filter.AddRootFilter() var prmSearch PrmObjectSearch prmSearch.SetContainerID(cnrID) prmSearch.SetFilters(filter) resSearch, err := clientPool.SearchObjects(ctx, prmSearch) require.NoError(t, err) var list []oid.ID err = resSearch.Iterate(func(id oid.ID) bool { fmt.Println("found object:", id.EncodeToString()) list = append(list, id) return false }) require.NoError(t, err) require.Len(t, list, 1) var addr oid.Address addr.SetContainer(cnrID) addr.SetObject(list[0]) var prmGet PrmObjectGet prmGet.SetAddress(addr) resGet, err := clientPool.GetObject(ctx, prmGet) require.NoError(t, err) data, err := io.ReadAll(resGet.Payload) require.NoError(t, err) require.Equal(t, "content", string(data)) var prmDelete PrmObjectDelete prmDelete.SetAddress(addr) err = clientPool.DeleteObject(ctx, prmDelete) fmt.Println("delete object error:", err) } ``` Node version: [c3e23a14](https://git.frostfs.info/TrueCloudLab/frostfs-node/commit/c3e23a14489b97aab71392e436a435e99f6d7361) (current master)
dkirillov added the
question
triage
labels 2023-08-10 13:06:01 +00:00
fyrchik added the
frostfs-node
label 2023-08-10 14:23:08 +00:00
fyrchik added this to the vNext milestone 2023-08-25 09:51:38 +00:00
fyrchik added the
bug
label 2023-08-25 09:52:54 +00:00
fyrchik modified the milestone from vNext to v0.37.0 2023-08-25 09:52:58 +00:00
Owner

I have tested it on master and everything works as expected:

  1. First put fails
  2. Second put succeeds.
    The object can be found because we do not remove partially place objects currently.
{
		var prmPut PrmObjectPut
		prmPut.SetCopiesNumber(4)
		prmPut.SetHeader(*obj)
		prmPut.SetPayload(bytes.NewBufferString("content"))

		_, err = clientPool.PutObject(ctx, prmPut)
		fmt.Println("put object error:", err)
		require.Error(t, err)
	}

	{
		var prmPut PrmObjectPut
		prmPut.SetCopiesNumber(3)
		prmPut.SetHeader(*obj)
		prmPut.SetPayload(bytes.NewBufferString("content"))

		_, err = clientPool.PutObject(ctx, prmPut)
		fmt.Println("put object error:", err)
		require.NoError(t, err)
	}
I have tested it on master and everything works as expected: 1. First put fails 2. Second put succeeds. The object can be found because we do not remove partially place objects currently. ``` { var prmPut PrmObjectPut prmPut.SetCopiesNumber(4) prmPut.SetHeader(*obj) prmPut.SetPayload(bytes.NewBufferString("content")) _, err = clientPool.PutObject(ctx, prmPut) fmt.Println("put object error:", err) require.Error(t, err) } { var prmPut PrmObjectPut prmPut.SetCopiesNumber(3) prmPut.SetHeader(*obj) prmPut.SetPayload(bytes.NewBufferString("content")) _, err = clientPool.PutObject(ctx, prmPut) fmt.Println("put object error:", err) require.NoError(t, err) } ```
Author
Member

@fyrchik what about removing? Is it ok that we cannot delete object in the second case?

@fyrchik what about removing? Is it ok that we cannot delete object in the second case?
Owner

This whole tests succeeded when I ran it today on master branch.

This whole tests succeeded when I ran it today on master branch.
Owner

We have reproduced the issue -- with REP 4 it is a primary placement that fails, not additional broadcast.
In this case this is an expected behaviour -- tombstones are no worse than any other object.
With REP 3 everything works as expected cc @dkirillov

We have reproduced the issue -- with `REP 4` it is a primary placement that fails, not additional broadcast. In this case this is an expected behaviour -- tombstones are no worse than any other object. With `REP 3` everything works as expected cc @dkirillov
fyrchik reopened this issue 2023-08-28 08:18:29 +00:00
Owner

Agree that this is expected behaviour, but there is definitely room for improvements in the protocol. Should we keep discussion here or go to API repository?

Also, as far as I remember, tombstones are 'broadcasted' to all container nodes regardless of REP policy. This is a bit different object placement compare to any other regular objects. Maybe it's okay to return successful status code?

Agree that this is expected behaviour, but there is definitely room for improvements in the protocol. Should we keep discussion here or go to API repository? Also, as far as I remember, tombstones are 'broadcasted' to all container nodes regardless of `REP` policy. This is a bit different object placement compare to any other regular objects. Maybe it's okay to return successful status code?
Owner

Tombstones are first put as simple objects and then broadcasted as "best-effort". The former PUT fails in the described scenario.
If we prepare tombstone on the client side and then put it we can use copies_number.

I agree, there is room for improvement -- may be we can add copies-number to DELETE (or use X-Headers for this)?

Tombstones are first put as simple objects and then broadcasted as "best-effort". The former PUT fails in the described scenario. If we prepare tombstone on the client side and then put it we can use `copies_number`. I agree, there is room for improvement -- may be we can add copies-number to DELETE (or use X-Headers for this)?
Owner

I created TrueCloudLab/frostfs-api#33 and TrueCloudLab/frostfs-api#34 as possible protocol improvements. For now it is closed, because described node behaviour follows protocol.

I created https://git.frostfs.info/TrueCloudLab/frostfs-api/issues/33 and https://git.frostfs.info/TrueCloudLab/frostfs-api/issues/34 as possible protocol improvements. For now it is closed, because described node behaviour follows protocol.
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#594
No description provided.