Validate advertised node addresses before adding to netmap #1497

New issue

Open

opened 2024-11-14 10:23:24 +00:00 by potyarkin · 3 comments

potyarkin commented

2024-11-14 10:23:24 +00:00

Member

I've encountered all sorts of weird problems (OOM, cryptic errors returned for PUT requests, etc) due to a misconfiguration on my part: storage nodes were (mis)configured to advertise both their and their neighbors' addresses in node.addresses[]:

   addresses:  # list of addresses announced by Storage node in the Network map
     - s01.frostfs.devenv:8080
     - /dns4/s02.frostfs.devenv/tcp/8081
     - grpc://127.0.0.1:8082
     - grpcs://localhost:8083

Describe the solution you'd like

Let's discuss whether innerring should intervene and gracefully handle such scenarios. This is especially relevant for public FrostFS deployments where untrusted actors may intentionally add misconfigured storage nodes to the network.

Innerring node may check (a) whether the advertised address is responsive and (b) whether the node replying on that address is the one that's advertising it. I think that dropping unresponsive addresses from netmap is a step too far (e.g. nodes may want to advertise their LAN address for local peering) but what about dropping addresses which sign replies with a wrong key? Theoretically, there exists a chance of false positive (LAN address collision between different LANs) but is that significant enough?

Describe alternatives you've considered

Fixing all dysfunctional behaviors caused by nodes advertising wrong addresses on netmap would be quite an effort, but I guess that's still an alternative to consider.

Additional context

Example of cryptic error caused by misconfigured storage nodes (private chat):

status: code = 1024 message = incomplete object PUT by placement: internal/key.go: public key is different from the key in the network map: want 0383cafefa22109a9c1a0feac60d0ca464bcd5432adfad35b863d08e81e2790bbf, got 02604799e1413e07d2e749c05401abaaf15571856b20db488380dd59c8e5f2a79e

## Is your feature request related to a problem? Please describe. I've encountered all sorts of weird problems (OOM, cryptic errors returned for PUT requests, etc) due to a misconfiguration on my part: storage nodes were (mis)configured to advertise both their and their neighbors' addresses in `node.addresses[]`: https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d77a218f7c1a449369eb6d63e00ae1906984aed4/config/example/node.yaml#L27-L31 ## Describe the solution you'd like Let's discuss whether innerring should intervene and gracefully handle such scenarios. This is especially relevant for public FrostFS deployments where untrusted actors may intentionally add misconfigured storage nodes to the network. Innerring node may check (a) whether the advertised address is responsive and (b) whether the node replying on that address is the one that's advertising it. I think that dropping unresponsive addresses from netmap is a step too far (e.g. nodes may want to advertise their LAN address for local peering) but what about dropping addresses which sign replies with a wrong key? Theoretically, there exists a chance of false positive (LAN address collision between different LANs) but is that significant enough? ## Describe alternatives you've considered Fixing all dysfunctional behaviors caused by nodes advertising wrong addresses on netmap would be quite an effort, but I guess that's still an alternative to consider. ## Additional context - Example of cryptic error caused by misconfigured storage nodes ([private chat](https://chat.yadro.com/yadro/pl/qou9kkb49iy49joqb6rop1mcfh)): ``` status: code = 1024 message = incomplete object PUT by placement: internal/key.go: public key is different from the key in the network map: want 0383cafefa22109a9c1a0feac60d0ca464bcd5432adfad35b863d08e81e2790bbf, got 02604799e1413e07d2e749c05401abaaf15571856b20db488380dd59c8e5f2a79e ```

potyarkin added the

labels 2024-11-14 10:23:24 +00:00

potyarkin changed title from ~~innerring: Validate advertised node addresses before adding to netmap~~ to Validate advertised node addresses before adding to netmap

2024-11-14 10:23:50 +00:00

fyrchik commented

2024-12-23 10:58:43 +00:00

Owner

I think that should be solved with the reputation system. The problem with a single validation event:

IR cannot distinguish its own problems from the problems of a storage node (aka https://downforeveryoneorjustme.com)
Addresses can be unavailable time to time, some history needs to be accumulated before kicking the node.

To prevent misconfiguration, we may enable some validation on the node itself, like matching node.addresses section with grpc.endpoints. However, this is not possible in general case, and I am not sure any specific rule will be useful.

I think that should be solved with the reputation system. The problem with a single validation event: 1. IR cannot distinguish its own problems from the problems of a storage node (aka https://downforeveryoneorjustme.com) 2. Addresses can be unavailable time to time, some history needs to be accumulated before kicking the node. To prevent misconfiguration, we may enable some validation on the node itself, like matching `node.addresses` section with `grpc.endpoints`. However, this is not possible in general case, and I am not sure any specific rule will be useful.

fyrchik commented

2024-12-23 10:59:58 +00:00

Owner

Example of cryptic error caused by misconfigured storage nodes

@potyarkin could you elaborate on what specific misconfiguration caused such an error message?
Each node has only 1 key and should already be validated during bootstrap.

>Example of cryptic error caused by misconfigured storage nodes @potyarkin could you elaborate on what specific misconfiguration caused such an error message? Each node has only 1 key and should already be validated during bootstrap.

potyarkin commented

2024-12-23 11:47:37 +00:00

Author

Member

My misconfigured nodes advertised addresses of all their neighbors as their own, i.e.

# storage-node-01
addresses:
- storage-node-01:8802
- storage-node-02:8802
- storage-node-03:8802
- storage-node-04:8802

# storage-node-02
addresses:
- storage-node-01:8802
- storage-node-02:8802
- storage-node-03:8802
- storage-node-04:8802

So when some client would want to talk specifically to storage-node-01 it would in fact connect to any random node from the cluster, sometimes to a correct one but often not. Cryptic error messages are not the worst that could happen. I think (but can not prove) that OOMs in my cluster were also caused by this.

This was an honest mistake and I should've read the docs better, but this can also be used with malice. If untrusted actors are allowed to add storage nodes (like in the planet-wide FrostFS scenario) they can configure their nodes to knowingly advertise addresses of other good nodes - and this will wreak all sorts of havoc, not only on the misconfigured nodes but through the whole FrostFS network.

My misconfigured nodes advertised addresses of all their neighbors as their own, i.e. ```yaml # storage-node-01 addresses: - storage-node-01:8802 - storage-node-02:8802 - storage-node-03:8802 - storage-node-04:8802 # storage-node-02 addresses: - storage-node-01:8802 - storage-node-02:8802 - storage-node-03:8802 - storage-node-04:8802 ``` So when some client would want to talk specifically to `storage-node-01` it would in fact connect to any random node from the cluster, sometimes to a correct one but often not. Cryptic error messages are not the worst that could happen. I think (but can not prove) that OOMs in my cluster were also caused by this. This was an honest mistake and I should've read the docs better, but this can also be used with malice. If untrusted actors are allowed to add storage nodes (like in the planet-wide FrostFS scenario) they can configure their nodes to knowingly advertise addresses of other good nodes - and this will wreak all sorts of havoc, not only on the misconfigured nodes but through the whole FrostFS network.