Set "resync_metabase: true" leads to OOM on virtual environment #428

Closed
opened 2023-06-06 12:27:02 +00:00 by anikeev-yadro · 2 comments
Member

Autotest testsuites.failovers.test_failover_storage.TestStorageDataLoss#test_metabase_loss

Expected Behavior

Set "resync_metabase: true" shouldn't leads to OOM.

Current Behavior

Set "resync_metabase: true" leads to OOM.

Steps to Reproduce (for bugs)

  1. Stop ALL storage nodes
  2. Delete ALL metabases on ALL nodes
  3. Set "resync_metabase: true" on ALL nodes
  4. Start ALL storage nodes
  5. See OOM errors in log on node1 and node2:
HOST: 172.26.160.125
COMMAND:
 sudo journalctl --no-pager --since '2023-06-02 02:03:43' --until '2023-06-02 06:26:49' --grep '\Wpanic\W|\Woom\W|\Wtoo many open files\W' --case-sensitive=0
RC:
 0
STDOUT:
 -- Journal begins at Fri 2023-06-02 01:43:59 UTC, ends at Fri 2023-06-02 06:28:21 UTC. --
 -- Boot 568f976d11e14629a3b1e240bbe0b8f3 --
 Jun 02 05:10:30 frostfs-failover-node1 kernel: tatlin-object-s invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:10:30 frostfs-failover-node1 systemd[1]: frostfs-storage.service: A process of this unit has been killed by the OOM killer.
 Jun 02 05:10:31 frostfs-failover-node1 systemd[1]: frostfs-storage.service: Failed with result 'oom-kill'.
 Jun 02 05:16:22 frostfs-failover-node1 kernel: neo-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:19:21 frostfs-failover-node1 kernel: prometheus-aler invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:21:25 frostfs-failover-node1 kernel: tatlin-object-s invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:22:25 frostfs-failover-node1 kernel: prometheus-node invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:27:12 frostfs-failover-node1 kernel: vmagent invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:29:36 frostfs-failover-node1 kernel: frostfs-http-gw invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:32:09 frostfs-failover-node1 kernel: tatlin-object-s invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:38:08 frostfs-failover-node1 kernel: neo-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:38:07 frostfs-failover-node1 systemd[1]: frostfs-storage.service: A process of this unit has been killed by the OOM killer.
 Jun 02 05:38:07 frostfs-failover-node1 systemd[1]: frostfs-storage.service: Failed with result 'oom-kill'.
 Jun 02 05:43:00 frostfs-failover-node1 kernel: tatlin-object-s invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

STDERR:

Start / End / Elapsed	 06:26:49.551727 / 06:27:03.755407 / 0:00:14.203680
HOST: 172.26.160.31
COMMAND:
 sudo journalctl --no-pager --since '2023-06-02 02:03:43' --until '2023-06-02 06:26:49' --grep '\Wpanic\W|\Woom\W|\Wtoo many open files\W' --case-sensitive=0
RC:
 0
STDOUT:
 -- Journal begins at Fri 2023-06-02 01:44:00 UTC, ends at Fri 2023-06-02 06:28:49 UTC. --
 -- Boot f7726e3e55bd401d95b51f557389f20b --
 -- Boot 6b77a0583184445e84f6ffad831e9b35 --
 -- Boot de71e47677e44d4783b5d2d0a66117cd --
 -- Boot 6a24af39e54649efa36f381346e9cb65 --
 -- Boot 193d002ed7a046aba727f7819674d8ef --
 Jun 02 05:13:44 frostfs-failover-node2 kernel: vmagent invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:13:44 frostfs-failover-node2 systemd[1]: frostfs-storage.service: A process of this unit has been killed by the OOM killer.
 Jun 02 05:13:44 frostfs-failover-node2 systemd[1]: frostfs-storage.service: Failed with result 'oom-kill'.
 Jun 02 05:37:56 frostfs-failover-node2 kernel: neo-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
 Jun 02 05:37:55 frostfs-failover-node2 systemd[1]: frostfs-storage.service: A process of this unit has been killed by the OOM killer.
 Jun 02 05:37:55 frostfs-failover-node2 systemd[1]: frostfs-storage.service: Failed with result 'oom-kill'.
 Jun 02 05:44:21 frostfs-failover-node2 kernel: neo-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

STDERR:

Start / End / Elapsed	 06:27:17.562353 / 06:27:23.340772 / 0:00:05.778419

Context

Test subject: After metabase loss on all nodes operations on objects and buckets should be still available via S3

Version

0.0.1-432-g405e17b2

Your Environment

Virtual
4 nodes

Autotest `testsuites.failovers.test_failover_storage.TestStorageDataLoss#test_metabase_loss` ## Expected Behavior Set "resync_metabase: true" shouldn't leads to OOM. ## Current Behavior Set "resync_metabase: true" leads to OOM. ## Steps to Reproduce (for bugs) 1. Stop ALL storage nodes 2. Delete ALL metabases on ALL nodes 3. Set "resync_metabase: true" on ALL nodes 4. Start ALL storage nodes 5. See OOM errors in log on node1 and node2: ``` HOST: 172.26.160.125 COMMAND: sudo journalctl --no-pager --since '2023-06-02 02:03:43' --until '2023-06-02 06:26:49' --grep '\Wpanic\W|\Woom\W|\Wtoo many open files\W' --case-sensitive=0 RC: 0 STDOUT: -- Journal begins at Fri 2023-06-02 01:43:59 UTC, ends at Fri 2023-06-02 06:28:21 UTC. -- -- Boot 568f976d11e14629a3b1e240bbe0b8f3 -- Jun 02 05:10:30 frostfs-failover-node1 kernel: tatlin-object-s invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:10:30 frostfs-failover-node1 systemd[1]: frostfs-storage.service: A process of this unit has been killed by the OOM killer. Jun 02 05:10:31 frostfs-failover-node1 systemd[1]: frostfs-storage.service: Failed with result 'oom-kill'. Jun 02 05:16:22 frostfs-failover-node1 kernel: neo-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:19:21 frostfs-failover-node1 kernel: prometheus-aler invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:21:25 frostfs-failover-node1 kernel: tatlin-object-s invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:22:25 frostfs-failover-node1 kernel: prometheus-node invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:27:12 frostfs-failover-node1 kernel: vmagent invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:29:36 frostfs-failover-node1 kernel: frostfs-http-gw invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:32:09 frostfs-failover-node1 kernel: tatlin-object-s invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:38:08 frostfs-failover-node1 kernel: neo-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:38:07 frostfs-failover-node1 systemd[1]: frostfs-storage.service: A process of this unit has been killed by the OOM killer. Jun 02 05:38:07 frostfs-failover-node1 systemd[1]: frostfs-storage.service: Failed with result 'oom-kill'. Jun 02 05:43:00 frostfs-failover-node1 kernel: tatlin-object-s invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 STDERR: Start / End / Elapsed 06:26:49.551727 / 06:27:03.755407 / 0:00:14.203680 ``` ``` HOST: 172.26.160.31 COMMAND: sudo journalctl --no-pager --since '2023-06-02 02:03:43' --until '2023-06-02 06:26:49' --grep '\Wpanic\W|\Woom\W|\Wtoo many open files\W' --case-sensitive=0 RC: 0 STDOUT: -- Journal begins at Fri 2023-06-02 01:44:00 UTC, ends at Fri 2023-06-02 06:28:49 UTC. -- -- Boot f7726e3e55bd401d95b51f557389f20b -- -- Boot 6b77a0583184445e84f6ffad831e9b35 -- -- Boot de71e47677e44d4783b5d2d0a66117cd -- -- Boot 6a24af39e54649efa36f381346e9cb65 -- -- Boot 193d002ed7a046aba727f7819674d8ef -- Jun 02 05:13:44 frostfs-failover-node2 kernel: vmagent invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:13:44 frostfs-failover-node2 systemd[1]: frostfs-storage.service: A process of this unit has been killed by the OOM killer. Jun 02 05:13:44 frostfs-failover-node2 systemd[1]: frostfs-storage.service: Failed with result 'oom-kill'. Jun 02 05:37:56 frostfs-failover-node2 kernel: neo-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jun 02 05:37:55 frostfs-failover-node2 systemd[1]: frostfs-storage.service: A process of this unit has been killed by the OOM killer. Jun 02 05:37:55 frostfs-failover-node2 systemd[1]: frostfs-storage.service: Failed with result 'oom-kill'. Jun 02 05:44:21 frostfs-failover-node2 kernel: neo-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 STDERR: Start / End / Elapsed 06:27:17.562353 / 06:27:23.340772 / 0:00:05.778419 ``` ## Context Test subject: After metabase loss on all nodes operations on objects and buckets should be still available via S3 ## Version ``` 0.0.1-432-g405e17b2 ``` ## Your Environment Virtual 4 nodes
anikeev-yadro added the
bug
triage
labels 2023-06-06 12:27:02 +00:00
Member

@anikeev-yadro , please, can you clarify:

How is it possible to delete piloramas while the cluster is shutdown?

 with allure.step("Stop storage services on all nodes"):
     cluster_state_controller.stop_all_storage_services()

 with allure.step("Delete metabase from all nodes"):
     for node in cluster_state_controller.cluster.storage_nodes:
         node.delete_metabase()
@anikeev-yadro , please, can you clarify: How is it possible to delete piloramas while the cluster is shutdown? ``` with allure.step("Stop storage services on all nodes"): cluster_state_controller.stop_all_storage_services() with allure.step("Delete metabase from all nodes"): for node in cluster_state_controller.cluster.storage_nodes: node.delete_metabase() ```
Author
Member

cluster_state_controller.stop_all_storage_services() means that we stop service frostfs-storage

`cluster_state_controller.stop_all_storage_services()` means that we stop service `frostfs-storage`
fyrchik added this to the v0.38.0 milestone 2023-06-14 13:19:26 +00:00
dstepanov-yadro self-assigned this 2023-06-20 14:35:27 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#428
No description provided.