node: Process killing by systemd

acid-ant commented

2023-04-17 09:38:10 +00:00

Member

Expected Behavior

# sudo systemctl stop frostfs-storage.service
# sudo systemctl status frostfs-storage.service --lines=0
● frostfs-storage.service - FrostFS Storage node
      ...
      Active: inactive (dead) since Sun 2023-04-16 05:20:57 UTC; 611ms ago
      ...

Current Behavior

# sudo systemctl stop frostfs-storage.service
# sudo systemctl status frostfs-storage.service --lines=0
● frostfs-storage.service - FrostFS Storage node
      ...
      Active: failed (Result: timeout) since Mon 2023-04-17 05:23:29 UTC; 1min 14s ago
      ...

Steps to Reproduce

Exclude node from network map - rostfs-cli control set-status --status offline
Tick Epoch - frostfs-adm morph force-new-epoch
Stop service - sudo systemctl stop frostfs-storage.service

Regression
Yes

**Expected Behavior** ``` # sudo systemctl stop frostfs-storage.service # sudo systemctl status frostfs-storage.service --lines=0 ● frostfs-storage.service - FrostFS Storage node ... Active: inactive (dead) since Sun 2023-04-16 05:20:57 UTC; 611ms ago ... ``` **Current Behavior** ``` # sudo systemctl stop frostfs-storage.service # sudo systemctl status frostfs-storage.service --lines=0 ● frostfs-storage.service - FrostFS Storage node ... Active: failed (Result: timeout) since Mon 2023-04-17 05:23:29 UTC; 1min 14s ago ... ``` **Steps to Reproduce** 1. Exclude node from network map - `rostfs-cli control set-status --status offline` 2. Tick Epoch - `frostfs-adm morph force-new-epoch` 3. Stop service - `sudo systemctl stop frostfs-storage.service` **Regression** Yes

acid-ant added the

triage

label 2023-04-17 09:38:10 +00:00

abereziny commented

2023-04-18 11:45:09 +00:00

Member

We hit same issue on hardware deployment with just step 3

sudo systemctl stop frostfs-storage.service
sudo systemctl status frostfs-storage.service --lines=0

<..>
Active: failed (Result: timeout) since Tue 2023-04-18 11:08:36 UTC; 576ms ago
<..>

We hit same issue on hardware deployment with just step 3 ``` sudo systemctl stop frostfs-storage.service sudo systemctl status frostfs-storage.service --lines=0 <..> Active: failed (Result: timeout) since Tue 2023-04-18 11:08:36 UTC; 576ms ago <..> ```

acid-ant commented

2023-04-18 12:04:41 +00:00

Author

Member

@abereziny could you add in test code one more call to pprof? I think somewhere near systemctl status.
With this info it will be much easier to solve this issue:

# curl http://{NODE_IP}:6060/debug/pprof/goroutine?debug=1
goroutine profile: total 212
64 @ 0x44345d 0x4542aa 0x13e3248 0x478861
#    0x13e3247    git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree.(*Service).localReplicationWorker+0x107    git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree/replicator.go:46

64 @ 0x44345d 0x4542aa 0x13e3a4b 0x478861
#    0x13e3a4a    git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree.(*Service).replicationWorker+0x10a    git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree/replicator.go:69

12 @ 0x44345d 0x40763e 0x4072f8 0x1174125 0x478861
#    0x1174124    github.com/panjf2000/ants/v2.(*Pool).periodicallyPurge+0x104    github.com/panjf2000/ants/v2@v2.4.0/pool.go:72

7 @ 0x44345d 0x40763e 0x4072f8 0x1175587 0x478861
...

@abereziny could you add in test code one more call to pprof? I think somewhere near `systemctl status`. With this info it will be much easier to solve this issue: ``` # curl http://{NODE_IP}:6060/debug/pprof/goroutine?debug=1 goroutine profile: total 212 64 @ 0x44345d 0x4542aa 0x13e3248 0x478861 # 0x13e3247 git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree.(*Service).localReplicationWorker+0x107 git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree/replicator.go:46 64 @ 0x44345d 0x4542aa 0x13e3a4b 0x478861 # 0x13e3a4a git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree.(*Service).replicationWorker+0x10a git.frostfs.info/TrueCloudLab/frostfs-node/pkg/services/tree/replicator.go:69 12 @ 0x44345d 0x40763e 0x4072f8 0x1174125 0x478861 # 0x1174124 github.com/panjf2000/ants/v2.(*Pool).periodicallyPurge+0x104 github.com/panjf2000/ants/v2@v2.4.0/pool.go:72 7 @ 0x44345d 0x40763e 0x4072f8 0x1175587 0x478861 ... ```

acid-ant commented

2023-04-18 12:33:07 +00:00

Author

Member

@abereziny the idea is to stop in background and check status(with pprof) until service stopped

abereziny commented

2023-04-18 12:39:08 +00:00

Member

@abereziny the idea is to stop in background and check status(with pprof) until service stopped

sudo systemctl stop frostfs-storage.service is a sync call. After it returns control services is already failed.
So if we wan't some info during this we should probably do nohup sudo systemctl... or something.

> @abereziny the idea is to stop in background and check status(with pprof) until service stopped `sudo systemctl stop frostfs-storage.service` is a sync call. After it returns control services is already failed. So if we wan't some info during this we should probably do `nohup sudo systemctl...` or something.