Dev env produces a lot of zombie processes #69

Closed
opened 2024-06-27 07:22:30 +00:00 by Phaseant · 11 comments

Dev env produces a lot of zombie frostfs-cli and grep processes

Expected Behavior

No zombie processes should be created

Current Behavior

A lot of zombie processes are waiting to end

Possible Solution

Steps to Reproduce (for bugs)

  1. make up
  2. list zombie processes ps -ef | grep defunct
  3. count them ps -ef | grep defunct | wc -l

Context

This issue affects performance of the whole server, CPU usage avg is 100% because of it

Dev env produces a lot of zombie frostfs-cli and grep processes ## Expected Behavior No zombie processes should be created ## Current Behavior A lot of zombie processes are waiting to end ## Possible Solution - ## Steps to Reproduce (for bugs) 1. `make up` 2. list zombie processes `ps -ef | grep defunct` 3. count them `ps -ef | grep defunct | wc -l` ## Context This issue affects performance of the whole server, CPU usage avg is 100% because of it
Phaseant added the
bug
label 2024-06-27 07:22:30 +00:00
Author

parents of this processes are frostfs-node --config /etc/frostfs/storage/config.yml

parents of this processes are `frostfs-node --config /etc/frostfs/storage/config.yml`
Owner

It could be a healthcheck

/frostfs-cli control healthcheck -c /cli-cfg.yml \

It could be a healthcheck https://git.frostfs.info/TrueCloudLab/frostfs-dev-env/src/commit/2b6122192a4c26a84e6f8faa5533ab52fb2a88ef/services/storage/healthcheck.sh#L3
fyrchik self-assigned this 2024-06-27 10:51:11 +00:00
Author

so it is expected behaviour?

so it is expected behaviour?
Owner

No, it is not, I was just trying to where these processes have spawned.

It likely happens when healthcheck hasn't been able to execute in a required timeout.
Here we have 1s


And the default frostfs-cli timeout is 15 seconds.

So what happens is that we execute shell, which spawns subprocesses, then the parent is killed (because of the timeout), and all the children are retained.
The solution is to not spawn any subprocesses. For this to work we need to support this in frostfs-cli directly, I will create a task shortly.

Meanwhile, the solution could be increasing this timeout (from e.g. 1s to 30s), while also providing --timeout 1s argument to the frostfs-cli (so that internal healthcheck.sh script timeout will always be less than the timeout provided to docker compose).

No, it is not, I was just trying to where these processes have spawned. It likely happens when healthcheck hasn't been able to execute in a required timeout. Here we have 1s https://git.frostfs.info/TrueCloudLab/frostfs-dev-env/src/commit/2b6122192a4c26a84e6f8faa5533ab52fb2a88ef/services/storage/docker-compose.yml#L41 And the default `frostfs-cli` timeout is 15 seconds. So what happens is that we execute shell, which spawns subprocesses, then the parent is killed (because of the timeout), and all the children are retained. The solution is to not spawn any subprocesses. For this to work we need to support this in frostfs-cli directly, I will create a task shortly. Meanwhile, the solution could be increasing this timeout (from e.g. 1s to 30s), while also providing `--timeout 1s` argument to the `frostfs-cli` (so that internal `healthcheck.sh` script timeout will always be less than the timeout provided to docker compose).
Owner
Depends on https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/1209
Owner

@Phaseant given that 1s was not enough on your machine, you might want to have --timeout 5s in the healthcheck.sh and e.g. 10s in docker-compose.yml

@Phaseant given that 1s was not enough on your machine, you might want to have `--timeout 5s` in the `healthcheck.sh` and e.g. `10s` in `docker-compose.yml`
Author

ok, great! Awaiting your solution :)

ok, great! Awaiting your solution :)
fyrchik added the
blocked
label 2024-06-27 12:36:30 +00:00
fyrchik removed the
blocked
label 2024-08-19 06:27:55 +00:00
Owner

@Phaseant please, check the master branch. Healthcheck has been reworked, should not leave zombies now.

@Phaseant please, check the master branch. Healthcheck has been reworked, should not leave zombies now.
Author

Ok, thanks!

Ok, thanks!
Author

@fyrchik

{"Status":"unhealthy","FailingStreak":13,"Log":[{"Start":"2024-08-19T09:13:43.589619252Z","End":"2024-08-19T09:13:44.790881666Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:46.791441579Z","End":"2024-08-19T09:13:48.079515011Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:50.080090588Z","End":"2024-08-19T09:13:51.297141629Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:53.298563888Z","End":"2024-08-19T09:13:54.497956436Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:56.498446793Z","End":"2024-08-19T09:13:57.691365416Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"}]}

From 3 different enviroments, one of them fails with 1s timeout. Increased timeout to 30s and everything works great. If you need logs to debug, write pls what logs to collect. (All metrics are disabled on this env)

@fyrchik {"Status":"unhealthy","FailingStreak":13,"Log":[{"Start":"2024-08-19T09:13:43.589619252Z","End":"2024-08-19T09:13:44.790881666Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:46.791441579Z","End":"2024-08-19T09:13:48.079515011Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:50.080090588Z","End":"2024-08-19T09:13:51.297141629Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:53.298563888Z","End":"2024-08-19T09:13:54.497956436Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:56.498446793Z","End":"2024-08-19T09:13:57.691365416Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"}]} From 3 different enviroments, one of them fails with 1s timeout. Increased timeout to 30s and everything works great. If you need logs to debug, write pls what logs to collect. (All metrics are disabled on this env)
Owner

That is ok, we didn't make it faster, but know when the process is killed with timeout, no zombie should remain.
We cannot make timeouts for all environment, the defaults in this repo are more for developer machines.

That is ok, we didn't make it faster, but know when the process is killed with timeout, no zombie should remain. We cannot make timeouts for all environment, the defaults in this repo are more for developer machines.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-dev-env#69
No description provided.