After shutting down the node, the node from the network map disappears not after 4 epochs, but after 7

d.zayakin commented

2023-04-28 08:42:09 +00:00

Member

Expected Behavior

When a node is disabled, we tick for 4 epochs (we take the <netmap_cleaner: threshold> value from the IR config, by default 3 and add 1 more) and in the netmap, the node on the disabled node should not be displayed in the map.

Current Behavior

After a tick of 4 epochs, the information in the netmap should be up to date, there should be no disconnected node. Now this happens only after tick 7 of the epoch.

Steps to Reproduce (for bugs)

Shutdown node1
Make sure that node1 has the status online in the network map

frostfs-cli --config /tmp/wallets/node1-storage-config.yml netmap snapshot --rpc-endpoint '127.0.0.1:8080' --wallet '/tmp/wallets/node1-storage.json'

Tick epoch N times (N=<netmap_cleaner: threshold+1 from IR config, default 3>)

sudo cat /etc/frostfs/ir/config.yml

Tick epoch 4 times

sudo frostfs-adm --config /home/service/config.yaml morph force-new-epoch

Make sure node1 is not in the network map

frostfs-cli --config /tmp/wallets/node1-storage-config.yml netmap snapshot --rpc-endpoint '127.0.0.1:8080' --wallet '/tmp/wallets/node1-storage.json'

Version

FrostFS CLI
Version: v0.0.1-413-g0beb7ccf 
GoVersion: go1.18.4

## Expected Behavior When a node is disabled, we tick for 4 epochs (we take the <netmap_cleaner: threshold> value from the IR config, by default 3 and add 1 more) and in the netmap, the node on the disabled node should not be displayed in the map. ## Current Behavior After a tick of 4 epochs, the information in the netmap should be up to date, there should be no disconnected node. Now this happens only after tick 7 of the epoch. ## Steps to Reproduce (for bugs) - Shutdown node1 - Make sure that node1 has the status online in the network map ``` frostfs-cli --config /tmp/wallets/node1-storage-config.yml netmap snapshot --rpc-endpoint '127.0.0.1:8080' --wallet '/tmp/wallets/node1-storage.json' ``` - Tick epoch N times (N=<netmap_cleaner: threshold+1 from IR config, default 3>) ``` sudo cat /etc/frostfs/ir/config.yml ``` ![изображение](/attachments/94815f9d-08e2-46b2-9249-9c84e48440af) Tick epoch 4 times ``` sudo frostfs-adm --config /home/service/config.yaml morph force-new-epoch ``` - Make sure node1 is not in the network map ``` frostfs-cli --config /tmp/wallets/node1-storage-config.yml netmap snapshot --rpc-endpoint '127.0.0.1:8080' --wallet '/tmp/wallets/node1-storage.json' ``` ## Version ``` FrostFS CLI Version: v0.0.1-413-g0beb7ccf GoVersion: go1.18.4 ```

изображение.png

5.9 KiB

journal-node1.tar.gz

422 KiB

journal-node2.tar.gz

957 KiB

journal-node3.tar.gz

390 KiB

journal-node4.tar.gz

528 KiB

d.zayakin added the

triage

label 2023-04-28 08:42:09 +00:00

fyrchik added this to the v0.37.0 milestone 2023-04-28 10:40:48 +00:00

fyrchik added

frostfs-ir

and removed

triage

labels 2023-04-28 10:40:58 +00:00

aarifullin was assigned by fyrchik

2023-05-03 07:59:21 +00:00

snegurochka added the

bug

label 2023-05-03 17:14:39 +00:00

aarifullin commented

2023-05-18 09:29:23 +00:00

Member

I have found these problems:

The scenario lacks an additional tick here

We need to tick new epochs until threshold works out, i.e. the scenario should force threshold + 1 ticks. In this moment frost-ir confirms that the node should be removed (vote to remove node from netmap)
frost-ir invokes UpdateStateIR method from netmap contract that removes the node from its storage
frost-ir needs to process one more epoch to update snapshot-id within netmap contract. After that frorst-ir can update its netmap where one node has been removed.

So, +1 should be fixed to +2

The races between ticks here

the script forces new epoch
frost-ir is trying to remove node. This happens asynchroniously in fact. At this moment the script forces new epoch again

[E] - "new epoch"
[U] - UpdateStateIR
[R] - node has been removed from the contract's storage

--------[E_N+1]-[U]-[E_N+2]-[R]----------> t

So, we need to sleep betweet two last ticks are being forced

The cloud bootstraps the shutdown node for no reason.
I get this problem when try to run the scenario second time (when first run is successful) in the deployed cluster.
It seems the scenario must assert here that the node is still shutdown**!!!**

A node is shutdown
check_objects_replication is running. While it is running the node is bootstrapped
The node is removed in frost-ir
The node is bootstrapped and frost-ir invokes AddPeer in the netmap's contracts
The node is returned back
Ticks don't help because the node is not removed

I strongly recommend to fix wait_for_host_offline that temporarily uses shell's ping but it does not really check if node has been shutdown

I have found these problems: 1. The scenario lacks an additional tick [here](https://git.frostfs.info/TrueCloudLab/frostfs-testcases/src/branch/master/pytest_tests/testsuites/failovers/test_failover_server.py#L189) - We need to tick new epochs until threshold works out, i.e. the scenario should force `threshold + 1` ticks. In this moment `frost-ir` confirms that the node should be removed (`vote to remove node from netmap`) - `frost-ir` invokes `UpdateStateIR` method from `netmap` contract that removes the node from its storage - `frost-ir` needs to process one more epoch to update snapshot-id within `netmap` contract. After that `frorst-ir` can update its netmap where one node has been removed. So, `+1` should be fixed to `+2` 2. The races between ticks [here](https://git.frostfs.info/TrueCloudLab/frostfs-testcases/src/branch/master/pytest_tests/testsuites/failovers/test_failover_server.py#L191) - the script forces new epoch - `frost-ir` is trying to remove node. This happens asynchroniously in fact. At this moment the script forces new epoch again ``` [E] - "new epoch" [U] - UpdateStateIR [R] - node has been removed from the contract's storage --------[E_N+1]-[U]-[E_N+2]-[R]----------> t ``` So, we need to sleep betweet two last ticks are being forced 3. The cloud bootstraps the shutdown node for no reason. I get this problem when try to run the scenario second time (when first run is successful) in the deployed cluster. It seems the scenario ***must*** assert [here](https://git.frostfs.info/TrueCloudLab/frostfs-testcases/src/branch/master/pytest_tests/testsuites/failovers/test_failover_server.py#L191) that the node is still shutdown**!!!** - A node is shutdown - `check_objects_replication` is running. While it is running the node is bootstrapped - The node is removed in `frost-ir` - The node is bootstrapped and `frost-ir` invokes `AddPeer` in the netmap's contracts - The node is returned back - Ticks don't help because the node is not removed I strongly recommend to fix `wait_for_host_offline` that temporarily uses shell's `ping` but it does not really check if node has been shutdown

fyrchik added the

blocked

label 2023-05-29 08:09:58 +00:00

fyrchik commented

2023-05-29 08:10:47 +00:00

Owner

Blocked until further discussion.
Need to discuss the behaviour, maybe we could make try removing nodes in the contract itself. Currently it is hard to test and debug.

Blocked until further discussion. Need to discuss the behaviour, maybe we could make try removing nodes in the contract itself. Currently it is hard to test and debug.

d.zayakin commented

2023-06-14 08:04:30 +00:00

Author

Member

Now the problem is reproduced when the nodes are loaded with objects from 200 mb. If the objects are smaller than the size, then the reproduction of the bug becomes irregular

fyrchik modified the milestone from v0.37.0 to v0.38.0

2023-08-29 09:36:21 +00:00

aarifullin was unassigned by fyrchik

2023-10-19 15:31:00 +00:00

fyrchik modified the milestone from v0.38.0 to v0.40.0

2024-05-16 10:45:15 +00:00

fyrchik modified the milestone from v0.40.0 to v0.41.0

2024-06-01 09:19:51 +00:00

fyrchik modified the milestone from v0.41.0 to v0.42.0

2024-06-14 07:08:10 +00:00

fyrchik removed the

blocked

label 2024-06-14 10:50:16 +00:00

fyrchik commented

2024-06-14 10:50:26 +00:00

Owner

There was a race condition in node #1110, could be related.

fyrchik closed this issue

2024-06-14 10:50:57 +00:00

d.zayakin reopened this issue

2024-08-06 07:30:44 +00:00

fyrchik modified the milestone from v0.42.0 to v0.43.0

2024-08-06 12:35:20 +00:00

fyrchik modified the milestone from v0.43.0 to v0.44.0

2024-09-30 11:51:38 +00:00

fyrchik modified the milestone from v0.44.0 to v0.45.0

2024-11-25 10:46:52 +00:00

Rows
Columns

After shutting down the node, the node from the network map disappears not after 4 epochs, but after 7 #292

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Version