After shutting down the node, the node from the network map disappears not after 4 epochs, but after 7 #292
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#292
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Expected Behavior
When a node is disabled, we tick for 4 epochs (we take the <netmap_cleaner: threshold> value from the IR config, by default 3 and add 1 more) and in the netmap, the node on the disabled node should not be displayed in the map.
Current Behavior
After a tick of 4 epochs, the information in the netmap should be up to date, there should be no disconnected node. Now this happens only after tick 7 of the epoch.
Steps to Reproduce (for bugs)
Tick epoch 4 times
Version
I have found these problems:
threshold + 1
ticks. In this momentfrost-ir
confirms that the node should be removed (vote to remove node from netmap
)frost-ir
invokesUpdateStateIR
method fromnetmap
contract that removes the node from its storagefrost-ir
needs to process one more epoch to update snapshot-id withinnetmap
contract. After thatfrorst-ir
can update its netmap where one node has been removed.So,
+1
should be fixed to+2
frost-ir
is trying to remove node. This happens asynchroniously in fact. At this moment the script forces new epoch againSo, we need to sleep betweet two last ticks are being forced
I get this problem when try to run the scenario second time (when first run is successful) in the deployed cluster.
It seems the scenario must assert here that the node is still shutdown**!!!**
check_objects_replication
is running. While it is running the node is bootstrappedfrost-ir
frost-ir
invokesAddPeer
in the netmap's contractsI strongly recommend to fix
wait_for_host_offline
that temporarily uses shell'sping
but it does not really check if node has been shutdownBlocked until further discussion.
Need to discuss the behaviour, maybe we could make try removing nodes in the contract itself. Currently it is hard to test and debug.
Now the problem is reproduced when the nodes are loaded with objects from 200 mb. If the objects are smaller than the size, then the reproduction of the bug becomes irregular
There was a race condition in node #1110, could be related.