Commit graph

56 commits

Author SHA1 Message Date
Roman Khimov
e80c60a3b9 network: rework broadcast logic
We have a number of queues for different purposes:
 * regular broadcast queue
 * direct p2p queue
 * high-priority queue

And two basic egress scenarios:
 * direct p2p messages (replies to requests in Server's handle* methods)
 * broadcasted messages

Low priority broadcasted messages:
 * transaction inventories
 * block inventories
 * notary inventories
 * non-consensus extensibles

High-priority broadcasted messages:
 * consensus extensibles
 * getdata transaction requests from consensus process
 * getaddr requests

P2P messages are a bit more complicated, most of the time they use p2p queue,
but extensible message requests/replies use HP queue.

Server's handle* code is run from Peer's handleIncoming, every peer has this
thread that handles incoming messages. When working with the peer it's
important to reply to requests and blocking this thread until we send (queue)
a reply is fine, if the peer is slow we just won't get anything new from
it. The queue used is irrelevant wrt this issue.

Broadcasted messages are radically different, we want them to be delivered to
many peers, but we don't care about specific ones. If it's delivered to 2/3 of
the peers we're fine, if it's delivered to more of them --- it's not an
issue. But doing this fairly is not an easy thing, current code tries performing
unblocked sends and if this doesn't yield enough results it then blocks (but
has a timeout, we can't wait indefinitely). But it does so in sequential
manner, once the peer is chosen the code will wait for it (and only it) until
timeout happens.

What can be done instead is an attempt to push the message to all of the peers
simultaneously (or close to that). If they all deliver --- OK, if some block
and wait then we can wait until _any_ of them pushes the message through (or
global timeout happens, we still can't wait forever). If we have enough
deliveries then we can cancel pending ones and it's again not an error if
these canceled threads still do their job.

This makes the system more dynamic and adds some substantial processing
overhead, but it's a networking code, any of this overhead is much lower than
the actual packet delivery time. It also allows to spread the load more
fairly, if there is any spare queue it'll get the packet and release the
broadcaster. On the next broadcast iteration another peer is more likely to be
chosen just because it didn't get a message previously (and had some time to
deliver already queued messages).

It works perfectly in tests, with optimal networking conditions we have much
better block times and TPS increases by 5-25%% depending on the scenario.

I'd go as far as to say that it fixes the original problem of #2678, because
in this particular scenario we have empty queues in ~100% of the cases and
this new logic will likely lead to 100% fan out in this case (cancelation just
won't happen fast enough). But when the load grows and there is some waiting
in the queue it will optimize out the slowest links.
2022-10-11 18:42:40 +03:00
Roman Khimov
dabdad20ad network: don't wait indefinitely for packet to be sent
Peers can be slow, very slow, slow enough to affect node's regular
operation. We can't wait for them indefinitely, there has to be a timeout for
send operations.

This patch uses TimePerBlock as a reference for its timeout. It's relatively
big and it doesn't affect tests much, 4+1 scenarios tend to perform a little
worse with while 7+2 scenarios work a little better. The difference is in some
percents, but all of these tests easily have 10-15% variations from run to
run.

It's an important step in making our gossip better because we can't have any
behavior where neighbors directly block the node forever, refs. #2678 and
2022-10-10 22:15:21 +03:00
Roman Khimov
4f3ffe7290 golangci: enable errorlint and fix everything it found 2022-09-02 18:36:23 +03:00
Elizaveta Chichindaeva
28908aa3cf [#2442] English Check
Signed-off-by: Elizaveta Chichindaeva <elizaveta@nspcc.ru>
2022-05-04 19:48:27 +03:00
Roman Khimov
60d6fa1125 network: keep a copy of the config inside of Server
Avoid copying the configuration again and again, make things a bit more
efficient.
2022-01-24 18:43:01 +03:00
Roman Khimov
774dee3cd4 network: fix disconnection race between handleConn() and handleIncoming()
handleIncoming() winning the race for p.Disconnect() call might lead to nil
error passed as the reason for peer unregistration.
2021-11-01 12:20:55 +03:00
Anna Shaleva
d67ff30704 core: implement statesync module
And support GetMPTData and MPTData P2P commands.
2021-09-07 19:43:27 +03:00
Roman Khimov
f78bd6474f network: handle incoming message in a separate goroutine
Network communication takes time. Handling some messages (like transaction)
also takes time. We can share this time by making handler a separate
goroutine. So while message is being handled receiver can already get and
parse the next one.

It doesn't improve metrics a lot, but still I think it makes sense and in some
scenarios this can be more beneficial than this.

e41fc2fd1b, 4 nodes, 10 workers

RPS    6732.979 6396.160 6759.624 6246.398 6589.841 ≈ 6545   ± 3.02%
TPS    6491.062 5984.190 6275.652 5867.477 6360.797 ≈ 6196   ± 3.77%
CPU %    42.053   43.515   44.768   40.344   44.112 ≈   43.0 ± 3.69%
Mem MB 2564.130 2744.236 2636.267 2589.505 2765.926 ≈ 2660   ± 3.06%

Patched:

RPS    6902.296 6465.662 6856.044 6785.515 6157.024 ≈ 6633   ± 4.26% ↑ 1.34%
TPS    6468.431 6218.867 6610.565 6288.596 5790.556 ≈ 6275   ± 4.44% ↑ 1.28%
CPU %    50.231   42.925   49.481   48.396   42.662 ≈   46.7 ± 7.01% ↑ 8.60%
Mem MB 2856.841 2684.103 2756.195 2733.485 2422.787 ≈ 2691   ± 5.40% ↑ 1.17%
2021-08-06 19:37:37 +03:00
Roman Khimov
601841ef35 *: drop unused structure fields
Found by structcheck:
 `good` is unused (structcheck)
and alike.
2021-05-12 19:41:23 +03:00
Roman Khimov
0888cf9ed2 network: drop Network from Message
It's not used any more.
2021-03-26 13:45:18 +03:00
Evgenii Stratonikov
84a3474fc5 network: set timeout on write
Fix a bug occuring under high load when node
hangs during this write.
2020-12-25 14:36:53 +03:00
Evgenii Stratonikov
0a5049658f network: support non-blocking broadcast
Right now a single slow peer can slow down whole network.
Do broadcast in 2 parts:
1. Perform non-blocking send to all peers if possible.
2. Perform blocking sends until message is sent to 2/3 of good peers.
2020-12-25 14:36:52 +03:00
Roman Khimov
2ce3c8b75f network: treat unsolicited addr commands as errors
See neo-project/neo#2097.
2020-11-25 13:34:38 +03:00
Evgenii Stratonikov
1869d6d460 core: allow to use state root in header 2020-11-20 17:16:32 +03:00
Roman Khimov
c8cc91eeee network: request blocks when there is a ping with bigger than ours height
Turns out, C# node no longer broadcasts an Inv when it's creating a block,
instead it sends a ping and if we're not paying attention to the height
specified there we're technically missing a new block. Of course we'll get it
later after ping timer expiration and regular ping/pong sequence, but that's
delaying it for no good reason.
2020-08-14 16:22:15 +03:00
Roman Khimov
0e2784cd2c always wrap errors when creating new ones with fmt.Errorf()
It doesn't really change anything in most of the cases, but it's a useful
habit anyway.

Fix #350.
2020-08-07 12:21:52 +03:00
Anna Shaleva
6c8accf18c core, network: request blocks instead of headers
Closes #1192

1. We now have CMDGetBlockByIndex, so there's no need to request headers
   first when we can just ask for blocks.
2. We don't ask for headers (i.e. we don't send CMDGetHeaders),
   consequently, we shouldn't react on CMDHeaders.
3. But we still keep on reacting on CMDGetHeaders command as
   there could be a node which needs headers.
2020-08-04 17:52:34 +03:00
Roman Khimov
b483c38593 block/transaction: add network magic into the hash
We make it explicit in the appropriate Block/Transaction structures, not via a
singleton as C# node does. I think this approach has a bit more potential and
allows better packages reuse for different purposes.
2020-06-18 12:39:50 +03:00
Anna Shaleva
8c5c248e79 protocol: add capabilities to address payload
Part of #871
2020-05-27 19:02:25 +03:00
Anna Shaleva
c590cc02f4 protocol: add capabilities to version payload
closes #871
2020-05-27 19:01:14 +03:00
Anna Shaleva
3bcc56bdcf protocol: switch to binary MessageCommand
closes #888
2020-05-21 13:57:49 +03:00
Roman Khimov
e41d434a49 *: move all packages from CityOfZion to nspcc-dev 2020-03-03 17:21:42 +03:00
Roman Khimov
d1a2296939 network: change the disconnect procedure
We can still lock the (*Server).run with dead peers:

Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: goroutine 40 [select, 871 minutes]:
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).putPacketIntoQueue(0xc030ab5320, 0xc02f251f20, 0xc00af0dcc0, 0x18, 0x40, 0x100000000000000, 0xffffffffffffffff)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:82 +0xf4
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).EnqueueHPPacket(0xc030ab5320, 0xc00af0dcc0, 0x18, 0x40, 0x1367240, 0xc03090ef98)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:124 +0x52
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*Server).iteratePeersWithSendMsg(0xc0000ca000, 0xc00af35800, 0xcb2a58, 0x0)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:720 +0x12a
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*Server).broadcastHPMessage(...)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:731
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*Server).run(0xc0000ca000)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:203 +0xee4
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*Server).Start(0xc0000ca000, 0xc000072ba0)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:173 +0x2ec
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: created by github.com/CityOfZion/neo-go/cli/server.startServer
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/cli/server/server.go:331 +0x476
...
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: goroutine 2199 [chan send, 870 minutes]:
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).Disconnect.func1()
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:366 +0x85
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: sync.(*Once).Do(0xc030ab403c, 0xc02f262788)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/usr/local/go/src/sync/once.go:44 +0xb3
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).Disconnect(0xc030ab4000, 0xd92440, 0xc000065a00)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:365 +0x6d
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).SendPing.func1()
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:394 +0x42
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: created by time.goFunc
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/usr/local/go/src/time/sleep.go:169 +0x44
...
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: goroutine 3448 [chan send, 854 minutes]:
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).handleConn(0xc01ed203f0)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:143 +0x6c
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: created by github.com/CityOfZion/neo-go/pkg/network.(*TCPTransport).Accept
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_transport.go:62 +0x44c
...

The problem is that the select in putPacketIntoQueue() only works the way it
was intended to after the `close(p.done)`, but that happens only after
successful unregistration request send. Thus, do disconnects the other way
around, first unblock queueing and exit goroutines, then destroy the
connection (if it wasn't previously destroyed) and only after that signal to
the Server.
2020-02-13 16:24:46 +03:00
Roman Khimov
7ee8f9c5d8 network: fix networking stalls caused by stale peers
We can leak sending goroutines and stall broadcasts because of already gone
peers that happened to be cached by some s.Peers() user (more than 800 of
these can be seen in nodoka log along with (*Server).run blocking on
CMDGetAddr send):

Feb 10 16:35:15 nodoka neo-go[1563]: goroutine 41 [chan send, 3320 minutes]:
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).putPacketIntoQueue(...)
Feb 10 16:35:15 nodoka neo-go[1563]:         /go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:81
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).EnqueueHPPacket(0xc0083d57a0, 0xc017206100, 0x18, 0x40, 0x136a240, 0xc018ef9720)
Feb 10 16:35:15 nodoka neo-go[1563]:         /go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:119 +0x98
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*Server).iteratePeersWithSendMsg(0xc0000ca000, 0xc0001848a0, 0xcb4550, 0x0)
Feb 10 16:35:15 nodoka neo-go[1563]:         /go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:720 +0x12a
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*Server).broadcastHPMessage(...)
Feb 10 16:35:15 nodoka neo-go[1563]:         /go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:731
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*Server).run(0xc0000ca000)
Feb 10 16:35:15 nodoka neo-go[1563]:         /go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:203 +0xee4
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*Server).Start(0xc0000ca000, 0xc000072c60)
Feb 10 16:35:15 nodoka neo-go[1563]:         /go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:173 +0x2ec
Feb 10 16:35:15 nodoka neo-go[1563]: created by github.com/CityOfZion/neo-go/cli/server.startServer
Feb 10 16:35:15 nodoka neo-go[1563]:         /go/src/github.com/CityOfZion/neo-go/cli/server/server.go:331 +0x476
2020-02-10 18:47:52 +03:00
Roman Khimov
fdbaac7a30 network: prevent broadcast queue starving, share time with p2p
Blocked broadcast queue of one peer may affect broadcasting capabilities of
the server, so prevent total blocking of it by p2p queue.
2020-01-30 14:03:52 +03:00
Roman Khimov
b2c4587dad network: fix PeerAddr() for not-yet-handshaked case
If we have already got Version message, we don't need the rest of handshake to
complete before being able to properly answer the PeerAddr() requests. Fixes
some duplicate connections between machines.
2020-01-30 14:03:52 +03:00
Roman Khimov
9eafec0d1d network: introduce peer-to-peer message queue
This one is designed to give more priority to direct nodes communication, that
is that their messaging would have more priority than generic broadcasts. It
should improve consensus process under TX pressure and allow to handle
pings in time (preventing disconnects).
2020-01-30 14:03:52 +03:00
Roman Khimov
1c28dd2567 network: add message type to disconnect error message
If it was caused by message processing, but only after the handshake to
preserve errIdenticalID and other handshaking errors.
2020-01-30 14:03:52 +03:00
Roman Khimov
06c3fbe455 network: rework ping sends, fix overpinging
Our node was too pingy because of wrong timer setups (that divided timeout
Duration by time.Second), it also was wrong in its time calculations (using
UTC time to calculate intervals). At the same time missing block is a
server-wide problem, so it's better solved with server-wide protocol loop.
2020-01-28 17:39:52 +03:00
Roman Khimov
99dfdc19e7 network: drop now useless addrReq queue from the server
Just broadcast a high-priority message to everyone.
2020-01-22 11:28:59 +03:00
Roman Khimov
34b863d645 network: introduce Server's MkMsg()
That wraps NewMessage() for a configured network.
2020-01-21 17:31:51 +03:00
Roman Khimov
1f672e0da7 network: move SendVersion() to the Peer
Only leave server-specific `getVersionMsg()` in the Server, all the other
logic is peer-related.
2020-01-21 17:26:08 +03:00
Roman Khimov
2c4ace022e network/config: redesign ping timeout handling a bit
1) Make timeout a timeout, don't do magic ping counts.
2) Drop additional timer from the main peer's protocol loop, create it
   dynamically and make it disconnect the peer.
3) Don't expose the ping counter to the outside, handle more logic inside the
   Peer.

Relates to #430.
2020-01-20 19:37:17 +03:00
Roman Khimov
62092c703d network: use local timestamp to decide when to ping
We don't and we won't have synchronized clocks in the network so the only
timestamp that we can compare our local time with is the one made
ourselves. What this ping mechanism is used for is to recover from missing the
block broadcast, thus it's appropriate for it to trigger after X seconds of
the local time since the last block received.

Relates to #430.
2020-01-20 19:37:17 +03:00
Roman Khimov
a8252ecc05 network: remove wrong ping condition
In reality it will never be true exactly in the case where we want this ping
mechanism to work --- when the node failed to get a block from the net. It
won't get the header either and thus its block height will be equal to header
height. The only moment when this condition is met is when the node does
initial synchronization and this synchronization works just fine without any
pings.

Relates to #430.
2020-01-20 19:37:17 +03:00
Roman Khimov
247cfa4165 network: either request blocks or ping a peer, but not both
It makes to sense to do both actions, pings are made for a different purpose.

Relates to #430.
2020-01-20 19:37:17 +03:00
Roman Khimov
0ba6b2a754 network: introduce peer sending queues
Two queues for high-priority and ordinary messages. Fixes #590. These queues
are deliberately made small to avoid buffer bloat problem, there is gonna be
another queueing layer above them to compensate for that. The queues are
designed to be synchronous in enqueueing, async capabilities are to be added
layer above later.
2020-01-20 17:23:26 +03:00
Roman Khimov
7f0882767c network: remove useless Done() method from the peer
It's internal state of the peer that no one should care about.
2020-01-20 17:23:26 +03:00
Roman Khimov
f39d5d5a10 network: fix unregistration on peer Disconnect
It should always signal to the server, not duplicating this send and not
missing it like it happened in the Server.run().
2020-01-20 17:23:26 +03:00
Roman Khimov
907a236285 network: move per-peer goroutines into the TCPPeer
As they're directly tied to it.
2020-01-20 17:23:26 +03:00
Vsevolod Brekelov
4e6ed9021c network: add ping pong processing
add pingInterval same as used in ref C# implementation with the same logic
add pingTimeout which is used to check whether pong received. If not -- drop the peer.
add pingLimit which is hardcoded to 4 in TCPPeer. It's limit for unsuccessful ping/pong calls (where pong wasn't received in pingTimeout interval)
2020-01-17 13:24:14 +03:00
Evgenii Stratonikov
e3098ed0f8 network: write messages atomically
Right now message can be written in several Write's so
concurrent calls of writeMsg() can in theory interleave.
This commit fixes it.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2019-11-18 09:31:00 +03:00
Roman Khimov
d7f747fa9a network: wait for both Version messages before ACKing
Otherwise the node might crash in `startProtocol` because of missing Version
field in the peer. And it also keeps the sequence correct, Version MUST be
sent first and ACKs can only follow it.
2019-11-06 18:05:50 +03:00
Roman Khimov
ec76ed23a5 network: rework peer handshaking, fix #458
This allows to start handshaking from both client and server (mainnet/testnet
nodes were seen to not care about string ordering for it), but still maintains
some sane checks in the process. It also makes functions thread-safe because
we have two goroutines servicing read and write side of the Peer connection,
so they can clash on access to the struct fields.

Add a test for it also.
2019-11-06 15:29:58 +03:00
Roman Khimov
e859e03240 network: split Peer's NetAddr into RemoteAddr and PeerAddr
As they are different things used for different purposes.
2019-11-06 15:26:24 +03:00
Roman Khimov
5bf00db2c9 io: move BinReader/BinWriter there, redo Serializable with it
The logic here is that we'll have all binary encoding/decoding done via our io
package, which simplifies error handling. This functionality doesn't belong to
util, so it's moved.

This also expands BufBinWriter with Reset() method to fit the needs of core
package.
2019-09-16 23:39:51 +03:00
Roman Khimov
d3bb8ddf8f network: handle errors and connection close more correctly
This makes writer side handle errors properly and fixes communication between
reader and writer goroutine to always correctly unregister the peer. This is
especially important for the case where error occurs before handshake
completes as in this case we don't even have goroutine in startProtocol()
running.
2019-09-16 16:32:04 +03:00
Roman Khimov
76c7cff67f network: make node strictly follow handshake procedure
Don't accept other messages before handshake is completed, check handshake
message sequence.
2019-09-16 16:32:04 +03:00
Roman Khimov
c6487423ae network: close connection on disconnect
If it's already closed, this won't hurt, but in the case of logical error it
saves us from leaking this connection (and potentially, peer).
2019-09-16 16:26:30 +03:00
Roman Khimov
8d9bc83214 util: drop Endpoint structure, fix #321
I think it's useless, buggy and hides parsing errors for no good reason.
2019-09-09 17:54:38 +03:00