Object not found in tree service after failover test with reboot one node #577

Closed
opened 2023-08-08 13:45:33 +00:00 by anikeev-yadro · 3 comments
Member

Expected Behavior

Reboot one node with load should not lead to load errors.

Current Behavior

Object not found errors in tree service after failover test with reboot one node

Steps to Reproduce (for bugs)

1.Init s3 creds:

HOST: 10.78.69.102
COMMAND:
 frostfs-s3-authmate  issue-secret --wallet '/tmp/1b88f9f1-b17e-4329-8804-697b67f11e45.json' --peer '10.78.70.118:8080' --gate-public-key '03f296058dc7ec43d1890d0ee224ac3b7efe919e273e566ac768a061909b485578' --gate-public-key '03f804d1e39a16e9c271c11d0655d0d4775e37563207c3deeae68b6416f7297186' --gate-public-key '03ba88a76e2960550a660f99a207e4672b97682415724798cf118e27dfedd21f12' --container-placement-policy 'REP 2 IN X CBF 2 SELECT 2 FROM * AS X' --container-policy '/etc/k6/scenarios/files/policy.json'
RC:
 0
STDOUT:

 Enter password for /tmp/1b88f9f1-b17e-4329-8804-697b67f11e45.json > 
 {
   "initial_access_key_id": "DKJTE9qsTC5FERzCeFLzgyfS4hnTUHXgbVsxqDr2qBW20DRz7i558oudmunBqsjaVERCXXmyaedvQYagFUoRnyKGs",
   "access_key_id": "DKJTE9qsTC5FERzCeFLzgyfS4hnTUHXgbVsxqDr2qBW20DRz7i558oudmunBqsjaVERCXXmyaedvQYagFUoRnyKGs",
   "secret_access_key": "afdc7a3e8a780961a14b40098b86ac587de7e64cec34ba7131e33e371b9a7e5c",
   "owner_private_key": "08b6a5955f8267bdc61c3a3ef3896017569a1f49f021633a992cc17e2a0c7db6",
   "wallet_public_key": "031c3f46aedc2ccab65d5e991e38eb4cd07de078478ee8fc6b0486b1b61dfe78f3",
   "container_id": "DKJTE9qsTC5FERzCeFLzgyfS4hnTUHXgbVsxqDr2qBW2"
 }

STDERR:

Start / End / Elapsed	 15:02:47.421259 / 15:02:50.458108 / 0:00:03.036849
  1. Create k6 preset (see attachment)
    3.Start load on 3 nodes
/etc/k6/k6 run -e NO_VERIFY_SSL='True' -e DURATION='600' -e WRITE_OBJ_SIZE='1024' -e REGISTRY_FILE='/var/log/autotests/s3_3800a4bd-5b9e-4f92-950c-b47b5632ba6a_0_registry.bolt' -e K6_SETUP_TIMEOUT='5s' -e WRITERS='50' -e READERS='50' -e DELETERS='0' -e PREGEN_JSON='/var/log/autotests/s3_3800a4bd-5b9e-4f92-950c-b47b5632ba6a_0_prepare.json' -e S3_ENDPOINTS='https://10.78.70.119,https://10.78.71.118,https://10.78.70.118,https://10.78.71.119,https://10.78.70.120,https://10.78.71.120' -e SUMMARY_JSON='/var/log/autotests/s3_3800a4bd-5b9e-4f92-950c-b47b5632ba6a_0_s3_summary.json' /etc/k6/scenarios/s3.js

4.Reboot 4th node

COMMAND: ipmitool -I lanplus -H 10.78.68.121 -U admin -P admin chassis power reset
RETCODE: 0
STDOUT:
Chassis Power Control: Reset
STDERR:
Start / End / Elapsed	 15:03:44.659856 / 15:03:44.880841 / 0:00:00.220985

5.Wait for 4th node will be online and verify k6 out and err logs (see attachments).
6.Found error for example

ERRO[06:16:36] operation error S3: GetObject, https response error StatusCode: 404, RequestID: e685480d-b07f-4812-8828-c8ea47fc3f51, HostID: 78737b33-ff8c-4fb3-a977-7ddff1980dff, NoSuchKey:   bucket=1fa76e67-7f67-43fe-bab2-ab9f659907df endpoint="https://10.78.70.119" key=f2e998e1-1036-47ed-a136-bfc85d68062e

7.Found CID

root@buky:~# curl -k --head https://10.78.70.119/1fa76e67-7f67-43fe-bab2-ab9f659907df
HTTP/2 200
server: nginx
date: Tue, 08 Aug 2023 08:31:48 GMT
content-length: 0
accept-ranges: bytes
x-amz-bucket-region: node-off
x-amz-request-id: 6b2f7c37-c221-40e4-ae64-02bcebbd651e
x-container-id: AvpLSGhmYcSywFdRgV5URn9DPnc81RfXrFsnHReqUHTn
x-container-name: 1fa76e67-7f67-43fe-bab2-ab9f659907df
x-container-zone: container
x-owner-id: NgDPLUdD6rkRWRk16jT2Twf6gVhr25tmTh
strict-transport-security: max-age=63072000

8.See that all nodes in container

root@buky:~# sudo frostfs-cli --rpc-endpoint 10.78.70.119:8080 --wallet /etc/frostfs/storage/wallet.json container nodes --cid AvpLSGhmYcSywFdRgV5URn9DPnc81RfXrFsnHReqUHTn
Enter password >
Descriptor #1, REP 2:
        Node 1: 0248fa73b84fe7df7bf02a3d2a8811dbec388057cd5594aec8f7dc1670d7aa90ca ONLINE /ip4/192.168.201.20/tcp/8080 /ip4/192.168.201.120/tcp/8080
                Continent: Europe
                Country: Finland
                CountryCode: FI
                ExternalAddr: /ip4/10.78.70.120/tcp/8080,/ip4/10.78.71.120/tcp/8080
                Location: Helsingfors (Helsinki)
                Node: 10.78.69.120
                Price: 10
                SubDiv: Uusimaa
                SubDivCode: 18
                UN-LOCODE: FI HEL
                role: alphabet
        Node 2: 03c39e68254c9f935b499d25d9059e597ee60a7412805fda8d3f10d2aa5136dae9 ONLINE /ip4/192.168.201.21/tcp/8080 /ip4/192.168.201.121/tcp/8080
                Continent: Europe
                Country: Sweden
                CountryCode: SE
                ExternalAddr: /ip4/10.78.70.121/tcp/8080,/ip4/10.78.71.121/tcp/8080
                Location: Stockholm
                Node: 10.78.69.121
                Price: 10
                SubDiv: Stockholms l�n
                SubDivCode: AB
                UN-LOCODE: SE STO
                role: alphabet
        Node 3: 02546419d10f5efb9e7359d48421cb686b8098c3e4f76ea442dda6b3063cbc7c50 ONLINE /ip4/192.168.201.19/tcp/8080 /ip4/192.168.201.119/tcp/8080
                Continent: Europe
                Country: Russia
                CountryCode: RU
                ExternalAddr: /ip4/10.78.70.119/tcp/8080,/ip4/10.78.71.119/tcp/8080
                Location: Saint Petersburg (ex Leningrad)
                Node: 10.78.69.119
                Price: 10
                SubDiv: Sankt-Peterburg
                SubDivCode: SPE
                UN-LOCODE: RU LED
                role: alphabet
        Node 4: 03858e19395bf273b9b625d5d2e41c45b5723398a83b291b3a37d021614b0cace1 ONLINE /ip4/192.168.201.18/tcp/8080 /ip4/192.168.201.118/tcp/8080
                Continent: Europe
                Country: Russia
                CountryCode: RU
                ExternalAddr: /ip4/10.78.70.118/tcp/8080,/ip4/10.78.71.118/tcp/8080
                Location: Moskva
                Node: 10.78.69.118
                Price: 10
                SubDiv: Moskva
                SubDivCode: MOW
                UN-LOCODE: RU MOW
                role: alphabet

Context

Run failover test with load with reboot one node.

Regression

Yes

Your Environment

HW
4 nodes

## Expected Behavior Reboot one node with load should not lead to load errors. ## Current Behavior `Object not found` errors in tree service after failover test with reboot one node ## Steps to Reproduce (for bugs) 1.Init s3 creds: ``` HOST: 10.78.69.102 COMMAND: frostfs-s3-authmate issue-secret --wallet '/tmp/1b88f9f1-b17e-4329-8804-697b67f11e45.json' --peer '10.78.70.118:8080' --gate-public-key '03f296058dc7ec43d1890d0ee224ac3b7efe919e273e566ac768a061909b485578' --gate-public-key '03f804d1e39a16e9c271c11d0655d0d4775e37563207c3deeae68b6416f7297186' --gate-public-key '03ba88a76e2960550a660f99a207e4672b97682415724798cf118e27dfedd21f12' --container-placement-policy 'REP 2 IN X CBF 2 SELECT 2 FROM * AS X' --container-policy '/etc/k6/scenarios/files/policy.json' RC: 0 STDOUT: Enter password for /tmp/1b88f9f1-b17e-4329-8804-697b67f11e45.json > { "initial_access_key_id": "DKJTE9qsTC5FERzCeFLzgyfS4hnTUHXgbVsxqDr2qBW20DRz7i558oudmunBqsjaVERCXXmyaedvQYagFUoRnyKGs", "access_key_id": "DKJTE9qsTC5FERzCeFLzgyfS4hnTUHXgbVsxqDr2qBW20DRz7i558oudmunBqsjaVERCXXmyaedvQYagFUoRnyKGs", "secret_access_key": "afdc7a3e8a780961a14b40098b86ac587de7e64cec34ba7131e33e371b9a7e5c", "owner_private_key": "08b6a5955f8267bdc61c3a3ef3896017569a1f49f021633a992cc17e2a0c7db6", "wallet_public_key": "031c3f46aedc2ccab65d5e991e38eb4cd07de078478ee8fc6b0486b1b61dfe78f3", "container_id": "DKJTE9qsTC5FERzCeFLzgyfS4hnTUHXgbVsxqDr2qBW2" } STDERR: Start / End / Elapsed 15:02:47.421259 / 15:02:50.458108 / 0:00:03.036849 ``` 2. Create k6 preset (see attachment) 3.Start load on 3 nodes ``` /etc/k6/k6 run -e NO_VERIFY_SSL='True' -e DURATION='600' -e WRITE_OBJ_SIZE='1024' -e REGISTRY_FILE='/var/log/autotests/s3_3800a4bd-5b9e-4f92-950c-b47b5632ba6a_0_registry.bolt' -e K6_SETUP_TIMEOUT='5s' -e WRITERS='50' -e READERS='50' -e DELETERS='0' -e PREGEN_JSON='/var/log/autotests/s3_3800a4bd-5b9e-4f92-950c-b47b5632ba6a_0_prepare.json' -e S3_ENDPOINTS='https://10.78.70.119,https://10.78.71.118,https://10.78.70.118,https://10.78.71.119,https://10.78.70.120,https://10.78.71.120' -e SUMMARY_JSON='/var/log/autotests/s3_3800a4bd-5b9e-4f92-950c-b47b5632ba6a_0_s3_summary.json' /etc/k6/scenarios/s3.js ``` 4.Reboot 4th node ``` COMMAND: ipmitool -I lanplus -H 10.78.68.121 -U admin -P admin chassis power reset RETCODE: 0 STDOUT: Chassis Power Control: Reset STDERR: Start / End / Elapsed 15:03:44.659856 / 15:03:44.880841 / 0:00:00.220985 ``` 5.Wait for 4th node will be online and verify k6 out and err logs (see attachments). 6.Found error for example ``` ERRO[06:16:36] operation error S3: GetObject, https response error StatusCode: 404, RequestID: e685480d-b07f-4812-8828-c8ea47fc3f51, HostID: 78737b33-ff8c-4fb3-a977-7ddff1980dff, NoSuchKey: bucket=1fa76e67-7f67-43fe-bab2-ab9f659907df endpoint="https://10.78.70.119" key=f2e998e1-1036-47ed-a136-bfc85d68062e ``` 7.Found CID ``` root@buky:~# curl -k --head https://10.78.70.119/1fa76e67-7f67-43fe-bab2-ab9f659907df HTTP/2 200 server: nginx date: Tue, 08 Aug 2023 08:31:48 GMT content-length: 0 accept-ranges: bytes x-amz-bucket-region: node-off x-amz-request-id: 6b2f7c37-c221-40e4-ae64-02bcebbd651e x-container-id: AvpLSGhmYcSywFdRgV5URn9DPnc81RfXrFsnHReqUHTn x-container-name: 1fa76e67-7f67-43fe-bab2-ab9f659907df x-container-zone: container x-owner-id: NgDPLUdD6rkRWRk16jT2Twf6gVhr25tmTh strict-transport-security: max-age=63072000 ``` 8.See that all nodes in container ``` root@buky:~# sudo frostfs-cli --rpc-endpoint 10.78.70.119:8080 --wallet /etc/frostfs/storage/wallet.json container nodes --cid AvpLSGhmYcSywFdRgV5URn9DPnc81RfXrFsnHReqUHTn Enter password > Descriptor #1, REP 2: Node 1: 0248fa73b84fe7df7bf02a3d2a8811dbec388057cd5594aec8f7dc1670d7aa90ca ONLINE /ip4/192.168.201.20/tcp/8080 /ip4/192.168.201.120/tcp/8080 Continent: Europe Country: Finland CountryCode: FI ExternalAddr: /ip4/10.78.70.120/tcp/8080,/ip4/10.78.71.120/tcp/8080 Location: Helsingfors (Helsinki) Node: 10.78.69.120 Price: 10 SubDiv: Uusimaa SubDivCode: 18 UN-LOCODE: FI HEL role: alphabet Node 2: 03c39e68254c9f935b499d25d9059e597ee60a7412805fda8d3f10d2aa5136dae9 ONLINE /ip4/192.168.201.21/tcp/8080 /ip4/192.168.201.121/tcp/8080 Continent: Europe Country: Sweden CountryCode: SE ExternalAddr: /ip4/10.78.70.121/tcp/8080,/ip4/10.78.71.121/tcp/8080 Location: Stockholm Node: 10.78.69.121 Price: 10 SubDiv: Stockholms l�n SubDivCode: AB UN-LOCODE: SE STO role: alphabet Node 3: 02546419d10f5efb9e7359d48421cb686b8098c3e4f76ea442dda6b3063cbc7c50 ONLINE /ip4/192.168.201.19/tcp/8080 /ip4/192.168.201.119/tcp/8080 Continent: Europe Country: Russia CountryCode: RU ExternalAddr: /ip4/10.78.70.119/tcp/8080,/ip4/10.78.71.119/tcp/8080 Location: Saint Petersburg (ex Leningrad) Node: 10.78.69.119 Price: 10 SubDiv: Sankt-Peterburg SubDivCode: SPE UN-LOCODE: RU LED role: alphabet Node 4: 03858e19395bf273b9b625d5d2e41c45b5723398a83b291b3a37d021614b0cace1 ONLINE /ip4/192.168.201.18/tcp/8080 /ip4/192.168.201.118/tcp/8080 Continent: Europe Country: Russia CountryCode: RU ExternalAddr: /ip4/10.78.70.118/tcp/8080,/ip4/10.78.71.118/tcp/8080 Location: Moskva Node: 10.78.69.118 Price: 10 SubDiv: Moskva SubDivCode: MOW UN-LOCODE: RU MOW role: alphabet ``` ## Context Run failover test with load with reboot one node. ## Regression Yes ## Your Environment HW 4 nodes
anikeev-yadro added the
bug
triage
labels 2023-08-08 13:45:33 +00:00
aarifullin self-assigned this 2023-08-11 07:11:21 +00:00
Member

I am going to describe how did I try to reproduce the bug.
It said that this bug occured in HW environment. It seems that this is not SberCloud env.

  1. I deployed pods with SberCloud and got 4 hosts
  • 172.26.160.42 (data0: 172.26.161.24)
  • 172.26.160.155 (data0: 172.26.161.224)
  • 172.26.160.48 (data0: 172.26.161.169)
  • 172.26.160.227 (data0: 172.26.161.142)
  1. I generated creds:
frostfs-s3-authmate issue-secret \
--wallet $WS/xk6-frostfs/scenarios/files/wallet.json \
--peer '172.26.161.24:8080'  \
--gate-public-key 034dbbfc... \
--gate-public-key 0305d1cd... \
--gate-public-key 02b624d1... \
--gate-public-key 031c678... \
--bearer-rules $WS/xk6-frostfs/scenarios/files/rules.json --disable-impersonate \
--container-placement-policy "REP 2 IN X CBF 2 SELECT 2 FROM * AS X" \
--container-policy $WS/xk6-frostfs/scenarios/files/policy.json

Gate public keys for each nodes are generated with:

sudo neo-go wallet dump-keys -w /etc/frostfs/s3/wallet.json | head -2 | tail -1
  1. Ran preset:
scenarios/preset/preset_s3.py --no-verify-ssl --workers 50 --size 1024 --buckets 40 --out s3_64kb.json --endpoint 172.26.161.24,172.26.161.224,172.26.161.169,172.26.161.142 --preload_obj 10 --location node-off

Copied (cp s3_64kb.json scenarios/)

  1. Ran k6
./k6 run -e NO_VERIFY_SSL='True' -e K6_SETUP_TIMEOUT='5s' -e DURATION=6000 -e WRITERS=50 -e READERS=50 -e DELETERS=0 -e DELETE_AGE=10 -e WRITE_OBJ_SIZE=1000 -e S3_ENDPOINTS=https://172.26.161.24,https://172.26.161.224,https://172.26.161.169,https://172.26.161.142 -e PREGEN_JSON=s3_64kb.json scenarios/s3.js
  1. Tried to turn off the node by the job

  2. I only got the such errors

ERRO[16:58:21] operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , request send failed, Get "https://172.26.161.142/b25d91b1-1c56-41d2-86a0-b590f5dc32e3/0788c2fa-2544-4040-ae46-f3a0823a8ebc?x-id=GetObject": dial tcp 172.26.161.142:443: connect: connection timed out  bucket=b25d91b1-1c56-41d2-86a0-b590f5dc32e3 endpoint="https://172.26.161.142" key=0788c2fa-2544-4040-ae46-f3a0823a8ebc
  1. After node restart, everything is OK

So, I could not reproduce the bug

@anikeev-yadro

  1. Why here are six endpoints: https://10.78.70.119,https://10.78.71.118,https://10.78.70.118,https://10.78.71.119,https://10.78.70.120,https://10.78.71.120? I cannot define what interface do they stand for
  2. Datacollect archive does not contain any journalctl logs
  3. Here is no deletion performed: DELETERS='0' and I suppose 404 doesn't deal with tree sync. Objects are re-read and re-written and therefore their versions are changed but they should be available anyway

I think I miss something in this scenario

I am going to describe how did I try to reproduce the bug. It said that this bug occured in `HW` environment. It seems that this is not `SberCloud` env. 1. I deployed pods with `SberCloud` and got 4 hosts - `172.26.160.42` (`data0`: `172.26.161.24`) - `172.26.160.155` (`data0`: `172.26.161.224`) - `172.26.160.48` (`data0`: `172.26.161.169`) - `172.26.160.227` (`data0`: `172.26.161.142`) 2. I generated creds: ```bash frostfs-s3-authmate issue-secret \ --wallet $WS/xk6-frostfs/scenarios/files/wallet.json \ --peer '172.26.161.24:8080' \ --gate-public-key 034dbbfc... \ --gate-public-key 0305d1cd... \ --gate-public-key 02b624d1... \ --gate-public-key 031c678... \ --bearer-rules $WS/xk6-frostfs/scenarios/files/rules.json --disable-impersonate \ --container-placement-policy "REP 2 IN X CBF 2 SELECT 2 FROM * AS X" \ --container-policy $WS/xk6-frostfs/scenarios/files/policy.json ``` Gate public keys for each nodes are generated with: ```bash sudo neo-go wallet dump-keys -w /etc/frostfs/s3/wallet.json | head -2 | tail -1 ``` 3. Ran preset: ```bash scenarios/preset/preset_s3.py --no-verify-ssl --workers 50 --size 1024 --buckets 40 --out s3_64kb.json --endpoint 172.26.161.24,172.26.161.224,172.26.161.169,172.26.161.142 --preload_obj 10 --location node-off ``` Copied (`cp s3_64kb.json scenarios/`) 4. Ran `k6` ```bash ./k6 run -e NO_VERIFY_SSL='True' -e K6_SETUP_TIMEOUT='5s' -e DURATION=6000 -e WRITERS=50 -e READERS=50 -e DELETERS=0 -e DELETE_AGE=10 -e WRITE_OBJ_SIZE=1000 -e S3_ENDPOINTS=https://172.26.161.24,https://172.26.161.224,https://172.26.161.169,https://172.26.161.142 -e PREGEN_JSON=s3_64kb.json scenarios/s3.js ``` 5. Tried to turn off the node by [the job](https://obj-jenkins.spb.yadro.com/job/sbercloud_power_manage/) 6. I only got the such errors ```bash ERRO[16:58:21] operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , request send failed, Get "https://172.26.161.142/b25d91b1-1c56-41d2-86a0-b590f5dc32e3/0788c2fa-2544-4040-ae46-f3a0823a8ebc?x-id=GetObject": dial tcp 172.26.161.142:443: connect: connection timed out bucket=b25d91b1-1c56-41d2-86a0-b590f5dc32e3 endpoint="https://172.26.161.142" key=0788c2fa-2544-4040-ae46-f3a0823a8ebc ``` 7. After node restart, everything is OK So, I could not reproduce the bug @anikeev-yadro 1. Why here are six endpoints: `https://10.78.70.119,https://10.78.71.118,https://10.78.70.118,https://10.78.71.119,https://10.78.70.120,https://10.78.71.120`? I cannot define what interface do they stand for 2. Datacollect archive does not contain any journalctl logs 3. Here is no deletion performed: `DELETERS='0'` and I suppose `404` doesn't deal with tree sync. Objects are re-read and re-written and therefore their versions are changed but they should be available anyway I think I miss something in this scenario
fyrchik added this to the v0.37.0 milestone 2023-08-16 07:04:29 +00:00
Author
Member

Not reproduced on the last build. Suggest to close.

Not reproduced on the last build. Suggest to close.
Member

Cannot reproduce: see above

Cannot reproduce: [see above](https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/577#issuecomment-23413)
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#577
No description provided.