Dev env produces a lot of zombie processes #69
Labels
No Label
P0
P1
P2
P3
good first issue
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-dev-env#69
Loading…
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
Dev env produces a lot of zombie frostfs-cli and grep processes
Expected Behavior
No zombie processes should be created
Current Behavior
A lot of zombie processes are waiting to end
Possible Solution
Steps to Reproduce (for bugs)
make up
ps -ef | grep defunct
ps -ef | grep defunct | wc -l
Context
This issue affects performance of the whole server, CPU usage avg is 100% because of it
parents of this processes are
frostfs-node --config /etc/frostfs/storage/config.yml
It could be a healthcheck
2b6122192a/services/storage/healthcheck.sh (L3)
so it is expected behaviour?
No, it is not, I was just trying to where these processes have spawned.
It likely happens when healthcheck hasn't been able to execute in a required timeout.
Here we have 1s
2b6122192a/services/storage/docker-compose.yml (L41)
And the default
frostfs-cli
timeout is 15 seconds.So what happens is that we execute shell, which spawns subprocesses, then the parent is killed (because of the timeout), and all the children are retained.
The solution is to not spawn any subprocesses. For this to work we need to support this in frostfs-cli directly, I will create a task shortly.
Meanwhile, the solution could be increasing this timeout (from e.g. 1s to 30s), while also providing
--timeout 1s
argument to thefrostfs-cli
(so that internalhealthcheck.sh
script timeout will always be less than the timeout provided to docker compose).Depends on TrueCloudLab/frostfs-node#1209
@Phaseant given that 1s was not enough on your machine, you might want to have
--timeout 5s
in thehealthcheck.sh
and e.g.10s
indocker-compose.yml
ok, great! Awaiting your solution :)