Morph client to neo-go died and couldn't reconnect #260

Closed
opened 2023-04-17 14:42:34 +00:00 by mmalygina · 1 comment
Member

Supposing that network blinked and morph client to neo-go died, it couldn't reconnect and hanged in cycle.

Expected Behavior

If network blinks morph client should reconnect to neo-go

Current Behavior

Morph client to neo-go died because network blinked and couldn't reconnect hanging in cycle

Logs extract from node az

Apr 16 12:23:43 az frostfs-node[2598392]: 2023-04-16T12:23:43.620Z        error        frostfs-node/object.go:65        could not get max object size value        {"error": "(*netmap.Client) could not get epoch number: could not perform test invocation (config): connection lost before registering response channel"}
...

Then in az logs there have non-stop errors "could not perform test invocation (get): connection lost before registering response channel"

Apr 16 12:23:49 az frostfs-node[2598392]: 2023-04-16T12:23:49.799Z        error        policer/check.go:76        could not get container        {"component": "Object Policer", "cid": "3ssH8Qtieg8F9nxb5UzHKGtYhTidrP9iu4nBXDewjJ5B", "error": "could not perform test invocation (get): connection lost before registering response channel"}
Apr 16 12:23:49 az frostfs-node[2598392]: 2023-04-16T12:23:49.799Z        error        policer/check.go:76        could not get container        {"component": "Object Policer", "cid": "3ssH8Qtieg8F9nxb5UzHKGtYhTidrP9iu4nBXDewjJ5B", "error": "could not perform test invocation (get): connection lost before registering response channel"}
Apr 16 12:23:49 az frostfs-node[2598392]: 2023-04-16T12:23:49.799Z        error        policer/check.go:76        could not get container        {"component": "Object Policer", "cid": "3ssH8Qtieg8F9nxb5UzHKGtYhTidrP9iu4nBXDewjJ5B", "error": "could not perform test invocation (get): connection lost before registering response channel"}

Steps to Reproduce (for bugs)

S3 write 512MB test was ran for 2 days 22 hours since 14.04.2023 12:00 UTC up to 17.04.2023 12:00.

Following k6 command was used

nohup /home/service/k6 run -e WRITERS=50  \
-e DURATION=432000 -e WRITE_OBJ_SIZE=512000 -e READERS=0 \
-e DELETERS=0 -e S3_ENDPOINTS=$S3ENDPOINTS \
-e WRITE_RATE=3 \
-e PREGEN_JSON=/home/service/s3_4kb.json /home/service/scenarios/constant-s3.js 2>&1 | tee 512MB-write-50th-3ops-15m-`date +%m-%d-%Y_%H-%M`-t$1.log &

At 12:23:43 "could not get epoch number: could not perform test invocation (config): connection lost before registering response channel" error occured on az node

See s3 errors on graphana

Regression

No

Your Environment

MetalCore cluster - 4 nodes 12HDDs per node
Cluster prefilled up to 25%
Preset 100 buckets

root@az:~# frostfs-node --version
FrostFS Storage node
Version: v0.0.1-384-gb689027d 
GoVersion: go1.18.4
root@az:~# cat /etc/to-release
VERSION="v1.2.0-nb-20230410.1"
Supposing that network blinked and morph client to neo-go died, it couldn't reconnect and hanged in cycle. ## Expected Behavior If network blinks morph client should reconnect to neo-go ## Current Behavior Morph client to neo-go died because network blinked and couldn't reconnect hanging in cycle Logs extract from node az ``` Apr 16 12:23:43 az frostfs-node[2598392]: 2023-04-16T12:23:43.620Z error frostfs-node/object.go:65 could not get max object size value {"error": "(*netmap.Client) could not get epoch number: could not perform test invocation (config): connection lost before registering response channel"} ... ``` Then in az logs there have non-stop errors "could not perform test invocation (get): connection lost before registering response channel" ``` Apr 16 12:23:49 az frostfs-node[2598392]: 2023-04-16T12:23:49.799Z error policer/check.go:76 could not get container {"component": "Object Policer", "cid": "3ssH8Qtieg8F9nxb5UzHKGtYhTidrP9iu4nBXDewjJ5B", "error": "could not perform test invocation (get): connection lost before registering response channel"} Apr 16 12:23:49 az frostfs-node[2598392]: 2023-04-16T12:23:49.799Z error policer/check.go:76 could not get container {"component": "Object Policer", "cid": "3ssH8Qtieg8F9nxb5UzHKGtYhTidrP9iu4nBXDewjJ5B", "error": "could not perform test invocation (get): connection lost before registering response channel"} Apr 16 12:23:49 az frostfs-node[2598392]: 2023-04-16T12:23:49.799Z error policer/check.go:76 could not get container {"component": "Object Policer", "cid": "3ssH8Qtieg8F9nxb5UzHKGtYhTidrP9iu4nBXDewjJ5B", "error": "could not perform test invocation (get): connection lost before registering response channel"} ``` ## Steps to Reproduce (for bugs) S3 write 512MB test was ran for 2 days 22 hours since 14.04.2023 12:00 UTC up to 17.04.2023 12:00. Following k6 command was used ``` nohup /home/service/k6 run -e WRITERS=50 \ -e DURATION=432000 -e WRITE_OBJ_SIZE=512000 -e READERS=0 \ -e DELETERS=0 -e S3_ENDPOINTS=$S3ENDPOINTS \ -e WRITE_RATE=3 \ -e PREGEN_JSON=/home/service/s3_4kb.json /home/service/scenarios/constant-s3.js 2>&1 | tee 512MB-write-50th-3ops-15m-`date +%m-%d-%Y_%H-%M`-t$1.log & ``` At 12:23:43 "could not get epoch number: could not perform test invocation (config): connection lost before registering response channel" error occured on az node See ![s3 errors on graphana](https://c.yadro.com/download/attachments/871221493/Screenshot%202023-04-17%20at%2013.42.32.png?version=1&modificationDate=1681736590611&api=v2) ## Regression No ## Your Environment MetalCore cluster - 4 nodes 12HDDs per node Cluster prefilled up to 25% Preset 100 buckets ``` root@az:~# frostfs-node --version FrostFS Storage node Version: v0.0.1-384-gb689027d GoVersion: go1.18.4 ``` ``` root@az:~# cat /etc/to-release VERSION="v1.2.0-nb-20230410.1" ```
mmalygina added the
triage
label 2023-04-17 14:42:34 +00:00
fyrchik added this to the v0.37.0 milestone 2023-04-17 16:52:30 +00:00
fyrchik added the
P1
label 2023-04-17 16:54:17 +00:00
carpawell was assigned by fyrchik 2023-04-17 17:21:58 +00:00
snegurochka added the
bug
label 2023-05-03 17:14:42 +00:00
Owner

Supposedly closed via #322, let's test and reopen if needed.

Supposedly closed via #322, let's test and reopen if needed.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#260
No description provided.