After shutting down the node, the node from the network map disappears not after 4 epochs, but after 7 #292

Closed
opened 2023-04-28 08:42:09 +00:00 by d.zayakin · 4 comments

Expected Behavior

When a node is disabled, we tick for 4 epochs (we take the <netmap_cleaner: threshold> value from the IR config, by default 3 and add 1 more) and in the netmap, the node on the disabled node should not be displayed in the map.

Current Behavior

After a tick of 4 epochs, the information in the netmap should be up to date, there should be no disconnected node. Now this happens only after tick 7 of the epoch.

Steps to Reproduce (for bugs)

  • Shutdown node1
  • Make sure that node1 has the status online in the network map
frostfs-cli --config /tmp/wallets/node1-storage-config.yml netmap snapshot --rpc-endpoint '127.0.0.1:8080' --wallet '/tmp/wallets/node1-storage.json'
  • Tick epoch N times (N=<netmap_cleaner: threshold+1 from IR config, default 3>)
sudo cat /etc/frostfs/ir/config.yml

изображение
Tick epoch 4 times

sudo frostfs-adm --config /home/service/config.yaml morph force-new-epoch 
  • Make sure node1 is not in the network map
frostfs-cli --config /tmp/wallets/node1-storage-config.yml netmap snapshot --rpc-endpoint '127.0.0.1:8080' --wallet '/tmp/wallets/node1-storage.json'

Version

FrostFS CLI
Version: v0.0.1-413-g0beb7ccf 
GoVersion: go1.18.4

<!-- Provide a general summary of the issue in the Title above --> ## Expected Behavior When a node is disabled, we tick for 4 epochs (we take the <netmap_cleaner: threshold> value from the IR config, by default 3 and add 1 more) and in the netmap, the node on the disabled node should not be displayed in the map. ## Current Behavior After a tick of 4 epochs, the information in the netmap should be up to date, there should be no disconnected node. Now this happens only after tick 7 of the epoch. ## Steps to Reproduce (for bugs) - Shutdown node1 - Make sure that node1 has the status online in the network map ``` frostfs-cli --config /tmp/wallets/node1-storage-config.yml netmap snapshot --rpc-endpoint '127.0.0.1:8080' --wallet '/tmp/wallets/node1-storage.json' ``` - Tick epoch N times (N=<netmap_cleaner: threshold+1 from IR config, default 3>) ``` sudo cat /etc/frostfs/ir/config.yml ``` ![изображение](/attachments/94815f9d-08e2-46b2-9249-9c84e48440af) Tick epoch 4 times ``` sudo frostfs-adm --config /home/service/config.yaml morph force-new-epoch ``` - Make sure node1 is not in the network map ``` frostfs-cli --config /tmp/wallets/node1-storage-config.yml netmap snapshot --rpc-endpoint '127.0.0.1:8080' --wallet '/tmp/wallets/node1-storage.json' ``` ## Version ``` FrostFS CLI Version: v0.0.1-413-g0beb7ccf GoVersion: go1.18.4 ```
d.zayakin added the
triage
label 2023-04-28 08:42:09 +00:00
fyrchik added this to the v0.37.0 milestone 2023-04-28 10:40:48 +00:00
fyrchik added
frostfs-ir
and removed
triage
labels 2023-04-28 10:40:58 +00:00
aarifullin was assigned by fyrchik 2023-05-03 07:59:21 +00:00
snegurochka added the
bug
label 2023-05-03 17:14:39 +00:00
Collaborator

I have found these problems:

  1. The scenario lacks an additional tick here
  • We need to tick new epochs until threshold works out, i.e. the scenario should force threshold + 1 ticks. In this moment frost-ir confirms that the node should be removed (vote to remove node from netmap)
  • frost-ir invokes UpdateStateIR method from netmap contract that removes the node from its storage
  • frost-ir needs to process one more epoch to update snapshot-id within netmap contract. After that frorst-ir can update its netmap where one node has been removed.

So, +1 should be fixed to +2

  1. The races between ticks here
  • the script forces new epoch
  • frost-ir is trying to remove node. This happens asynchroniously in fact. At this moment the script forces new epoch again
[E] - "new epoch"
[U] - UpdateStateIR
[R] - node has been removed from the contract's storage

--------[E_N+1]-[U]-[E_N+2]-[R]----------> t

So, we need to sleep betweet two last ticks are being forced

  1. The cloud bootstraps the shutdown node for no reason.
    I get this problem when try to run the scenario second time (when first run is successful) in the deployed cluster.
    It seems the scenario must assert here that the node is still shutdown**!!!**
  • A node is shutdown
  • check_objects_replication is running. While it is running the node is bootstrapped
  • The node is removed in frost-ir
  • The node is bootstrapped and frost-ir invokes AddPeer in the netmap's contracts
  • The node is returned back
  • Ticks don't help because the node is not removed

I strongly recommend to fix wait_for_host_offline that temporarily uses shell's ping but it does not really check if node has been shutdown

I have found these problems: 1. The scenario lacks an additional tick [here](https://git.frostfs.info/TrueCloudLab/frostfs-testcases/src/branch/master/pytest_tests/testsuites/failovers/test_failover_server.py#L189) - We need to tick new epochs until threshold works out, i.e. the scenario should force `threshold + 1` ticks. In this moment `frost-ir` confirms that the node should be removed (`vote to remove node from netmap`) - `frost-ir` invokes `UpdateStateIR` method from `netmap` contract that removes the node from its storage - `frost-ir` needs to process one more epoch to update snapshot-id within `netmap` contract. After that `frorst-ir` can update its netmap where one node has been removed. So, `+1` should be fixed to `+2` 2. The races between ticks [here](https://git.frostfs.info/TrueCloudLab/frostfs-testcases/src/branch/master/pytest_tests/testsuites/failovers/test_failover_server.py#L191) - the script forces new epoch - `frost-ir` is trying to remove node. This happens asynchroniously in fact. At this moment the script forces new epoch again ``` [E] - "new epoch" [U] - UpdateStateIR [R] - node has been removed from the contract's storage --------[E_N+1]-[U]-[E_N+2]-[R]----------> t ``` So, we need to sleep betweet two last ticks are being forced 3. The cloud bootstraps the shutdown node for no reason. I get this problem when try to run the scenario second time (when first run is successful) in the deployed cluster. It seems the scenario ***must*** assert [here](https://git.frostfs.info/TrueCloudLab/frostfs-testcases/src/branch/master/pytest_tests/testsuites/failovers/test_failover_server.py#L191) that the node is still shutdown**!!!** - A node is shutdown - `check_objects_replication` is running. While it is running the node is bootstrapped - The node is removed in `frost-ir` - The node is bootstrapped and `frost-ir` invokes `AddPeer` in the netmap's contracts - The node is returned back - Ticks don't help because the node is not removed I strongly recommend to fix `wait_for_host_offline` that temporarily uses shell's `ping` but it does not really check if node has been shutdown
fyrchik added the
blocked
label 2023-05-29 08:09:58 +00:00

Blocked until further discussion.
Need to discuss the behaviour, maybe we could make try removing nodes in the contract itself. Currently it is hard to test and debug.

Blocked until further discussion. Need to discuss the behaviour, maybe we could make try removing nodes in the contract itself. Currently it is hard to test and debug.

Now the problem is reproduced when the nodes are loaded with objects from 200 mb. If the objects are smaller than the size, then the reproduction of the bug becomes irregular

Now the problem is reproduced when the nodes are loaded with objects from 200 mb. If the objects are smaller than the size, then the reproduction of the bug becomes irregular
fyrchik modified the milestone from v0.37.0 to v0.38.0 2023-08-29 09:36:21 +00:00
aarifullin was unassigned by fyrchik 2023-10-19 15:31:00 +00:00
fyrchik modified the milestone from v0.38.0 to v0.40.0 2024-05-16 10:45:15 +00:00
fyrchik modified the milestone from v0.40.0 to v0.41.0 2024-06-01 09:19:51 +00:00
fyrchik modified the milestone from v0.41.0 to v0.42.0 2024-06-14 07:08:10 +00:00
fyrchik removed the
blocked
label 2024-06-14 10:50:16 +00:00

There was a race condition in node #1110, could be related.

There was a race condition in node #1110, could be related.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#292
There is no content yet.