node: Process killing by systemd #259

Closed
opened 2023-04-17 09:38:10 +00:00 by acid-ant · 10 comments
Collaborator

Expected Behavior

# sudo systemctl stop frostfs-storage.service
# sudo systemctl status frostfs-storage.service --lines=0
● frostfs-storage.service - FrostFS Storage node
      ...
      Active: inactive (dead) since Sun 2023-04-16 05:20:57 UTC; 611ms ago
      ...

Current Behavior

# sudo systemctl stop frostfs-storage.service
# sudo systemctl status frostfs-storage.service --lines=0
● frostfs-storage.service - FrostFS Storage node
      ...
      Active: failed (Result: timeout) since Mon 2023-04-17 05:23:29 UTC; 1min 14s ago
      ...

Steps to Reproduce

  1. Exclude node from network map - rostfs-cli control set-status --status offline
  2. Tick Epoch - frostfs-adm morph force-new-epoch
  3. Stop service - sudo systemctl stop frostfs-storage.service

Regression
Yes

**Expected Behavior** ``` # sudo systemctl stop frostfs-storage.service # sudo systemctl status frostfs-storage.service --lines=0 ● frostfs-storage.service - FrostFS Storage node ... Active: inactive (dead) since Sun 2023-04-16 05:20:57 UTC; 611ms ago ... ``` **Current Behavior** ``` # sudo systemctl stop frostfs-storage.service # sudo systemctl status frostfs-storage.service --lines=0 ● frostfs-storage.service - FrostFS Storage node ... Active: failed (Result: timeout) since Mon 2023-04-17 05:23:29 UTC; 1min 14s ago ... ``` **Steps to Reproduce** 1. Exclude node from network map - `rostfs-cli control set-status --status offline` 2. Tick Epoch - `frostfs-adm morph force-new-epoch` 3. Stop service - `sudo systemctl stop frostfs-storage.service` **Regression** Yes
acid-ant added the
triage
label 2023-04-17 09:38:10 +00:00

We hit same issue on hardware deployment with just step 3

sudo systemctl stop frostfs-storage.service
sudo systemctl status frostfs-storage.service --lines=0

<..>
Active: failed (Result: timeout) since Tue 2023-04-18 11:08:36 UTC; 576ms ago
<..>
We hit same issue on hardware deployment with just step 3 ``` sudo systemctl stop frostfs-storage.service sudo systemctl status frostfs-storage.service --lines=0 <..> Active: failed (Result: timeout) since Tue 2023-04-18 11:08:36 UTC; 576ms ago <..> ```
Poster
Collaborator

@abereziny could you add in test code one more call to pprof? I think somewhere near systemctl status.
With this info it will be much easier to solve this issue:

# curl http://{NODE_IP}:6060/debug/pprof/goroutine?debug=1
goroutine profile: total 212
64 @ 0x44345d 0x4542aa 0x13e3248 0x478861
#    0x13e3247    git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree.(*Service).localReplicationWorker+0x107    git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree/replicator.go:46

64 @ 0x44345d 0x4542aa 0x13e3a4b 0x478861
#    0x13e3a4a    git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree.(*Service).replicationWorker+0x10a    git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree/replicator.go:69

12 @ 0x44345d 0x40763e 0x4072f8 0x1174125 0x478861
#    0x1174124    github.com/panjf2000/ants/v2.(*Pool).periodicallyPurge+0x104    github.com/panjf2000/ants/v2@v2.4.0/pool.go:72

7 @ 0x44345d 0x40763e 0x4072f8 0x1175587 0x478861
...
@abereziny could you add in test code one more call to pprof? I think somewhere near `systemctl status`. With this info it will be much easier to solve this issue: ``` # curl http://{NODE_IP}:6060/debug/pprof/goroutine?debug=1 goroutine profile: total 212 64 @ 0x44345d 0x4542aa 0x13e3248 0x478861 # 0x13e3247 git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree.(*Service).localReplicationWorker+0x107 git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree/replicator.go:46 64 @ 0x44345d 0x4542aa 0x13e3a4b 0x478861 # 0x13e3a4a git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree.(*Service).replicationWorker+0x10a git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree/replicator.go:69 12 @ 0x44345d 0x40763e 0x4072f8 0x1174125 0x478861 # 0x1174124 github.com/panjf2000/ants/v2.(*Pool).periodicallyPurge+0x104 github.com/panjf2000/ants/v2@v2.4.0/pool.go:72 7 @ 0x44345d 0x40763e 0x4072f8 0x1175587 0x478861 ... ```
Poster
Collaborator

@abereziny the idea is to stop in background and check status(with pprof) until service stopped

@abereziny the idea is to stop in background and check status(with pprof) until service stopped

@abereziny the idea is to stop in background and check status(with pprof) until service stopped

sudo systemctl stop frostfs-storage.service is a sync call. After it returns control services is already failed.
So if we wan't some info during this we should probably do nohup sudo systemctl... or something.

> @abereziny the idea is to stop in background and check status(with pprof) until service stopped `sudo systemctl stop frostfs-storage.service` is a sync call. After it returns control services is already failed. So if we wan't some info during this we should probably do `nohup sudo systemctl...` or something.
fyrchik added the
P2
label 2023-04-19 14:07:31 +00:00

I wasn't able to reproduce on clean cluster. This means that we need pre-filled cluster and I'm currently struggle to find free one.

I wasn't able to reproduce on clean cluster. This means that we need pre-filled cluster and I'm currently struggle to find free one.

Can it be related to asynchronous write-cache initialization? cc @carpawell

Can it be related to asynchronous write-cache initialization? cc @carpawell
snegurochka added the
bug
label 2023-05-03 17:14:42 +00:00
fyrchik added this to the v0.38.0 milestone 2023-05-18 08:32:43 +00:00

Related #362, #364, #366.

Related #362, #364, #366.
acid-ant self-assigned this 2023-05-18 08:42:27 +00:00
Poster
Collaborator

Routines list before killing by systemd in attachment.

Routines list before killing by systemd in attachment.

I finally was able to gather pprof snapshots during process shutdown.

I finally was able to gather pprof snapshots during process shutdown.
Poster
Collaborator

Closed by #362, #363, #364, #366, #379, #403, #404

Closed by #362, #363, #364, #366, #379, #403, #404
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#259
There is no content yet.