Dev env produces a lot of zombie processes #69
Labels
No labels
P0
P1
P2
P3
good first issue
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-dev-env#69
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Dev env produces a lot of zombie frostfs-cli and grep processes
Expected Behavior
No zombie processes should be created
Current Behavior
A lot of zombie processes are waiting to end
Possible Solution
Steps to Reproduce (for bugs)
make up
ps -ef | grep defunct
ps -ef | grep defunct | wc -l
Context
This issue affects performance of the whole server, CPU usage avg is 100% because of it
parents of this processes are
frostfs-node --config /etc/frostfs/storage/config.yml
It could be a healthcheck
/frostfs-cli control healthcheck -c /cli-cfg.yml \
so it is expected behaviour?
No, it is not, I was just trying to where these processes have spawned.
It likely happens when healthcheck hasn't been able to execute in a required timeout.
Here we have 1s
timeout: 1s
And the default
frostfs-cli
timeout is 15 seconds.So what happens is that we execute shell, which spawns subprocesses, then the parent is killed (because of the timeout), and all the children are retained.
The solution is to not spawn any subprocesses. For this to work we need to support this in frostfs-cli directly, I will create a task shortly.
Meanwhile, the solution could be increasing this timeout (from e.g. 1s to 30s), while also providing
--timeout 1s
argument to thefrostfs-cli
(so that internalhealthcheck.sh
script timeout will always be less than the timeout provided to docker compose).--quiet
flag to the healthcheck command #1209Depends on TrueCloudLab/frostfs-node#1209
@Phaseant given that 1s was not enough on your machine, you might want to have
--timeout 5s
in thehealthcheck.sh
and e.g.10s
indocker-compose.yml
ok, great! Awaiting your solution :)
healthcheck
#72@Phaseant please, check the master branch. Healthcheck has been reworked, should not leave zombies now.
Ok, thanks!
@fyrchik
{"Status":"unhealthy","FailingStreak":13,"Log":[{"Start":"2024-08-19T09:13:43.589619252Z","End":"2024-08-19T09:13:44.790881666Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:46.791441579Z","End":"2024-08-19T09:13:48.079515011Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:50.080090588Z","End":"2024-08-19T09:13:51.297141629Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:53.298563888Z","End":"2024-08-19T09:13:54.497956436Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"},{"Start":"2024-08-19T09:13:56.498446793Z","End":"2024-08-19T09:13:57.691365416Z","ExitCode":-1,"Output":"Health check exceeded timeout (1s)"}]}
From 3 different enviroments, one of them fails with 1s timeout. Increased timeout to 30s and everything works great. If you need logs to debug, write pls what logs to collect. (All metrics are disabled on this env)
That is ok, we didn't make it faster, but know when the process is killed with timeout, no zombie should remain.
We cannot make timeouts for all environment, the defaults in this repo are more for developer machines.