Restart all services on random nodes leads to cycling restart frostfs-ir service #241
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#241
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Steps to Reproduce (for bugs)
Steps to reproduce:
testsuites.failover.test_failover_services.TestFailoverServices
See attached test log and allure report.
test_frostfs_node
I saw error:Versions
Your Environment
Virtual
frostfs-ir restarts because it cannot connect to the neo-go.
Not a bug.
While it's generally not ideal for a critical service like an IR service to die, in certain cases, it can be considered acceptable. IR is designed to be stateless, meaning it doesn't store data or maintain sessions, and can easily be restarted without causing data loss or system instability, then it may be safe to let systemd or another supervisor process automatically restart the service after it crashes. However, it's important to configure the supervisor process appropriately, setting thresholds for the number of restart attempts and time delays between them to avoid resource exhaustion or other issues.
However, while automating the restart of a service can be convenient, it's still essential to investigate the root cause of the failures that caused the service to crash in the first place. Implementing monitoring and logging can provide visibility into the service's health and allow timely resolution of issues that may cause the service to fail repeatedly. Therefore, while it may be safe for some services to die and restart automatically, proactive monitoring and troubleshooting are still critical to ensure a reliable and robust system.