Dev env produces a lot of zombie processes

Phaseant commented

2024-06-27 07:22:30 +00:00

Dev env produces a lot of zombie frostfs-cli and grep processes

Expected Behavior

No zombie processes should be created

Current Behavior

A lot of zombie processes are waiting to end

Possible Solution

Steps to Reproduce (for bugs)

make up
list zombie processes ps -ef | grep defunct
count them ps -ef | grep defunct | wc -l

Context

This issue affects performance of the whole server, CPU usage avg is 100% because of it

Dev env produces a lot of zombie frostfs-cli and grep processes ## Expected Behavior No zombie processes should be created ## Current Behavior A lot of zombie processes are waiting to end ## Possible Solution - ## Steps to Reproduce (for bugs) 1. `make up` 2. list zombie processes `ps -ef | grep defunct` 3. count them `ps -ef | grep defunct | wc -l` ## Context This issue affects performance of the whole server, CPU usage avg is 100% because of it

Screenshot 2024-06-27 at 10.22.12.png

17 KiB

Screenshot 2024-06-27 at 10.22.27.png

727 KiB

👍 1

Phaseant added the

bug

label 2024-06-27 07:22:30 +00:00

Phaseant commented

2024-06-27 09:02:38 +00:00

Author

parents of this processes are frostfs-node --config /etc/frostfs/storage/config.yml

parents of this processes are `frostfs-node --config /etc/frostfs/storage/config.yml`

fyrchik commented

2024-06-27 10:48:13 +00:00

Owner

It could be a healthcheck

/frostfs-cli control healthcheck -c /cli-cfg.yml \

It could be a healthcheck https://git.frostfs.info/TrueCloudLab/frostfs-dev-env/src/commit/2b6122192a4c26a84e6f8faa5533ab52fb2a88ef/services/storage/healthcheck.sh#L3

fyrchik self-assigned this 2024-06-27 10:51:11 +00:00

Phaseant commented

2024-06-27 10:57:24 +00:00

Author

so it is expected behaviour?

fyrchik commented

2024-06-27 11:54:45 +00:00

Owner

No, it is not, I was just trying to where these processes have spawned.

It likely happens when healthcheck hasn't been able to execute in a required timeout.
Here we have 1s

timeout: 1s

And the default frostfs-cli timeout is 15 seconds.

So what happens is that we execute shell, which spawns subprocesses, then the parent is killed (because of the timeout), and all the children are retained.
The solution is to not spawn any subprocesses. For this to work we need to support this in frostfs-cli directly, I will create a task shortly.

Meanwhile, the solution could be increasing this timeout (from e.g. 1s to 30s), while also providing --timeout 1s argument to the frostfs-cli (so that internal healthcheck.sh script timeout will always be less than the timeout provided to docker compose).

No, it is not, I was just trying to where these processes have spawned. It likely happens when healthcheck hasn't been able to execute in a required timeout. Here we have 1s https://git.frostfs.info/TrueCloudLab/frostfs-dev-env/src/commit/2b6122192a4c26a84e6f8faa5533ab52fb2a88ef/services/storage/docker-compose.yml#L41 And the default `frostfs-cli` timeout is 15 seconds. So what happens is that we execute shell, which spawns subprocesses, then the parent is killed (because of the timeout), and all the children are retained. The solution is to not spawn any subprocesses. For this to work we need to support this in frostfs-cli directly, I will create a task shortly. Meanwhile, the solution could be increasing this timeout (from e.g. 1s to 30s), while also providing `--timeout 1s` argument to the `frostfs-cli` (so that internal `healthcheck.sh` script timeout will always be less than the timeout provided to docker compose).

fyrchik referenced this issue from TrueCloudLab/frostfs-node

2024-06-27 11:56:06 +00:00

Add --quiet flag to the healthcheck command #1209

fyrchik commented

2024-06-27 11:56:19 +00:00

Owner

Depends on TrueCloudLab/frostfs-node#1209

Depends on https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/1209

fyrchik commented

2024-06-27 11:58:38 +00:00

Owner

@Phaseant given that 1s was not enough on your machine, you might want to have --timeout 5s in the healthcheck.sh and e.g. 10s in docker-compose.yml

@Phaseant given that 1s was not enough on your machine, you might want to have `--timeout 5s` in the `healthcheck.sh` and e.g. `10s` in `docker-compose.yml`

Phaseant commented

2024-06-27 11:58:58 +00:00

Author

ok, great! Awaiting your solution :)

fyrchik added the

blocked

label 2024-06-27 12:36:30 +00:00

achuprov referenced this issue from a pull request that will close it,

2024-07-08 07:41:26 +00:00

Add support for the -q flag in healthcheck #72

fyrchik referenced this issue from a commit

2024-08-19 06:25:05 +00:00

[#69] service/storage: Add support -q flag in healthcheck

fyrchik referenced this issue from a commit

2024-08-19 06:25:05 +00:00

[#69] service/ir: Add support -q flag in healthcheck

fyrchik closed this issue

2024-08-19 06:25:10 +00:00

fyrchik removed the

blocked

label 2024-08-19 06:27:55 +00:00

fyrchik commented

2024-08-19 06:28:25 +00:00

Owner

@Phaseant please, check the master branch. Healthcheck has been reworked, should not leave zombies now.

Phaseant commented

2024-08-19 08:09:25 +00:00

Author

Ok, thanks!

Phaseant commented

2024-08-19 09:19:58 +00:00

Author

@fyrchik

{"Status":"unhealthy","FailingStreak":13,"Log":[{"Start":"2024-08-19T09:13:43.589619252Z","End":"2024-08-19T09:13:44.790881666Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:46.791441579Z","End":"2024-08-19T09:13:48.079515011Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:50.080090588Z","End":"2024-08-19T09:13:51.297141629Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:53.298563888Z","End":"2024-08-19T09:13:54.497956436Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:56.498446793Z","End":"2024-08-19T09:13:57.691365416Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"}]}

From 3 different enviroments, one of them fails with 1s timeout. Increased timeout to 30s and everything works great. If you need logs to debug, write pls what logs to collect. (All metrics are disabled on this env)

@fyrchik {"Status":"unhealthy","FailingStreak":13,"Log":[{"Start":"2024-08-19T09:13:43.589619252Z","End":"2024-08-19T09:13:44.790881666Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:46.791441579Z","End":"2024-08-19T09:13:48.079515011Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:50.080090588Z","End":"2024-08-19T09:13:51.297141629Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:53.298563888Z","End":"2024-08-19T09:13:54.497956436Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:56.498446793Z","End":"2024-08-19T09:13:57.691365416Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"}]} From 3 different enviroments, one of them fails with 1s timeout. Increased timeout to 30s and everything works great. If you need logs to debug, write pls what logs to collect. (All metrics are disabled on this env)

fyrchik commented

2024-08-19 11:04:16 +00:00

Owner

That is ok, we didn't make it faster, but know when the process is killed with timeout, no zombie should remain.
We cannot make timeouts for all environment, the defaults in this repo are more for developer machines.

That is ok, we didn't make it faster, but know when the process is killed with timeout, no zombie should remain. We cannot make timeouts for all environment, the defaults in this repo are more for developer machines.

Rows
Columns

Dev env produces a lot of zombie processes #69

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context