Gate doesn't get tree from other node (round robin) #110
Labels
No labels
P0
P1
P2
P3
good first issue
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-s3-gw#110
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Expected Behavior
Gate should get tree info from some endpoins according config. And return the error after try to use all endpoints from config.
Current Behavior
Gate try to get tree info from one (not first in config list) endpoin and return the error.
Steps to Reproduce (for bugs)
1.Start s3 k6 load to nodes 1-3
2.Reboot node4
3.See errors in k6 log:
In node log same time error:
s3 config for tree:
Context
This affected faiover tests with reboot node.
Regression
Yes
Version
Your Environment
HW
4 nodes
@anikeev-yadro Could you enable
debug
log level? We should see switching info when request if failedI got some logs related to that. By the way it would be nice to pass request id deep down to any function that produce log records. So it would be easier to find.
This test may pass sometimes.
I see such records on a failed run:
So it seems that GW does iterate over endpoints, but:
tree not found
.Imagine that container and the tree is stored in .48 (localhost) and .51 nodes. In this case, if storage node does not resend requests, we got this behaviour.
We see two issues there:
In some cases S3 gateway expects
tree not found
errors and does not require to iterate over all available endpoints. In this issue, S3 gateway iterates over all nodes and getsconnection refused
error on last endpoint, which breaks request execution. We fix it there.However in the logs we see that S3 gateway may start before storage node and it eliminates localhost endpoint from the pool of tree service addresses. For now we want to update unit file for strict startup order, but we should consider something similar to pool healthchecks for tree service endpoints.
I think that update unit file for strict startup order won't help us. Because storage node return 'Active' state to systemd earlier than it open localhost socket.
@fyrchik please confirm.