We can still lock the (*Server).run with dead peers:
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: goroutine 40 [select, 871 minutes]:
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).putPacketIntoQueue(0xc030ab5320, 0xc02f251f20, 0xc00af0dcc0, 0x18, 0x40, 0x100000000000000, 0xffffffffffffffff)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:82 +0xf4
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).EnqueueHPPacket(0xc030ab5320, 0xc00af0dcc0, 0x18, 0x40, 0x1367240, 0xc03090ef98)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:124 +0x52
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*Server).iteratePeersWithSendMsg(0xc0000ca000, 0xc00af35800, 0xcb2a58, 0x0)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:720 +0x12a
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*Server).broadcastHPMessage(...)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:731
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*Server).run(0xc0000ca000)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:203 +0xee4
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*Server).Start(0xc0000ca000, 0xc000072ba0)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:173 +0x2ec
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: created by github.com/CityOfZion/neo-go/cli/server.startServer
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/cli/server/server.go:331 +0x476
...
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: goroutine 2199 [chan send, 870 minutes]:
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).Disconnect.func1()
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:366 +0x85
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: sync.(*Once).Do(0xc030ab403c, 0xc02f262788)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/usr/local/go/src/sync/once.go:44 +0xb3
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).Disconnect(0xc030ab4000, 0xd92440, 0xc000065a00)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:365 +0x6d
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).SendPing.func1()
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:394 +0x42
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: created by time.goFunc
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/usr/local/go/src/time/sleep.go:169 +0x44
...
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: goroutine 3448 [chan send, 854 minutes]:
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).handleConn(0xc01ed203f0)
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:143 +0x6c
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: created by github.com/CityOfZion/neo-go/pkg/network.(*TCPTransport).Accept
Feb 13 16:14:50 neo-go-node-2 neo-go[9448]: #011/go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_transport.go:62 +0x44c
...
The problem is that the select in putPacketIntoQueue() only works the way it
was intended to after the `close(p.done)`, but that happens only after
successful unregistration request send. Thus, do disconnects the other way
around, first unblock queueing and exit goroutines, then destroy the
connection (if it wasn't previously destroyed) and only after that signal to
the Server.
We can leak sending goroutines and stall broadcasts because of already gone
peers that happened to be cached by some s.Peers() user (more than 800 of
these can be seen in nodoka log along with (*Server).run blocking on
CMDGetAddr send):
Feb 10 16:35:15 nodoka neo-go[1563]: goroutine 41 [chan send, 3320 minutes]:
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).putPacketIntoQueue(...)
Feb 10 16:35:15 nodoka neo-go[1563]: /go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:81
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*TCPPeer).EnqueueHPPacket(0xc0083d57a0, 0xc017206100, 0x18, 0x40, 0x136a240, 0xc018ef9720)
Feb 10 16:35:15 nodoka neo-go[1563]: /go/src/github.com/CityOfZion/neo-go/pkg/network/tcp_peer.go:119 +0x98
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*Server).iteratePeersWithSendMsg(0xc0000ca000, 0xc0001848a0, 0xcb4550, 0x0)
Feb 10 16:35:15 nodoka neo-go[1563]: /go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:720 +0x12a
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*Server).broadcastHPMessage(...)
Feb 10 16:35:15 nodoka neo-go[1563]: /go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:731
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*Server).run(0xc0000ca000)
Feb 10 16:35:15 nodoka neo-go[1563]: /go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:203 +0xee4
Feb 10 16:35:15 nodoka neo-go[1563]: github.com/CityOfZion/neo-go/pkg/network.(*Server).Start(0xc0000ca000, 0xc000072c60)
Feb 10 16:35:15 nodoka neo-go[1563]: /go/src/github.com/CityOfZion/neo-go/pkg/network/server.go:173 +0x2ec
Feb 10 16:35:15 nodoka neo-go[1563]: created by github.com/CityOfZion/neo-go/cli/server.startServer
Feb 10 16:35:15 nodoka neo-go[1563]: /go/src/github.com/CityOfZion/neo-go/cli/server/server.go:331 +0x476
We can only add one block of the given height and we have two competing
goroutines to do that --- consensus and block queue. Whomever adds the block
first shouldn't trigger an error in another one.
Fix block relaying for blocks added via the block queue also, previously one
consensus-generated blocks were broadcasted.
Eliminate races between tx checks and adding them to the mempool, ensure the
chain doesn't change while we're working with the new tx. Ensure only one
block addition attempt could be in progress.
It can lead to some goroutine explosion, but supposedly it's better than
stalling other processing and eventually all of these goroutines should finish
their sends. Note that this doesn't change the behavior for RPC-relayed
transactions that are still waiting for the broadcast to finish ensuring
proper transaction distribution before returning the result to the client.
If we have already got Version message, we don't need the rest of handshake to
complete before being able to properly answer the PeerAddr() requests. Fixes
some duplicate connections between machines.
This one is designed to give more priority to direct nodes communication, that
is that their messaging would have more priority than generic broadcasts. It
should improve consensus process under TX pressure and allow to handle
pings in time (preventing disconnects).
They have the opposite order, height first and nonce second. It was done wrong
in 4e6ed902 and never fixed since. Fixes sending wrong peer state leading to
useless getheaders messages (and disconnects when the other side is lagging
behind).
We can have more than one connection attempt in progress and not yet completed
the handshake, so if there is a Version already received we should look it.
Our node was too pingy because of wrong timer setups (that divided timeout
Duration by time.Second), it also was wrong in its time calculations (using
UTC time to calculate intervals). At the same time missing block is a
server-wide problem, so it's better solved with server-wide protocol loop.
1) Make timeout a timeout, don't do magic ping counts.
2) Drop additional timer from the main peer's protocol loop, create it
dynamically and make it disconnect the peer.
3) Don't expose the ping counter to the outside, handle more logic inside the
Peer.
Relates to #430.
We don't and we won't have synchronized clocks in the network so the only
timestamp that we can compare our local time with is the one made
ourselves. What this ping mechanism is used for is to recover from missing the
block broadcast, thus it's appropriate for it to trigger after X seconds of
the local time since the last block received.
Relates to #430.
In reality it will never be true exactly in the case where we want this ping
mechanism to work --- when the node failed to get a block from the net. It
won't get the header either and thus its block height will be equal to header
height. The only moment when this condition is met is when the node does
initial synchronization and this synchronization works just fine without any
pings.
Relates to #430.
Two queues for high-priority and ordinary messages. Fixes#590. These queues
are deliberately made small to avoid buffer bloat problem, there is gonna be
another queueing layer above them to compensate for that. The queues are
designed to be synchronous in enqueueing, async capabilities are to be added
layer above later.
add pingInterval same as used in ref C# implementation with the same logic
add pingTimeout which is used to check whether pong received. If not -- drop the peer.
add pingLimit which is hardcoded to 4 in TCPPeer. It's limit for unsuccessful ping/pong calls (where pong wasn't received in pingTimeout interval)
It wasn't actually requesting transactions but rather sending an inventory
message telling everyone that we have them which is completely wrong and
easily leads to ChangeView that could be avoided.