Small (especially dockerized/virtualized) networks often start all nodes at
ones and then we see a lot of connection flapping in the log. This happens
because nodes try to connect to each other simultaneously, establish two
connections, then each one finds a duplicate and drops it, but this can be
different duplicate connections on other sides, so they retry and it all
happens for some time. Eventually everything settles, but we have a lot of
garbage in the log and a lot of useless attempts.
This random waiting timeout doesn't change the logic much, adds a minimal
delay, but increases chances for both nodes to establish a proper single
connection on both sides to only then see another one and drop it on both
sides as well. It leads to almost no flapping in small networks, doesn't
affect much bigger ones. The delay is close to unnoticeable especially if
there is something in the DB for node to process during startup.
Consider mainnet, it has an AttemptConnPeers of 20, so may already have 3
peers and request 20 more, then have 4th connected and attemtp 20 more again,
this leads to a huge number of connections easily.
* treat connected/handshaked peers separately in the discoverer, save
"original" address for connected ones, it can be a name instead of IP and
it's important to keep it to avoid reconnections
* store name->IP mapping for seeds if and when they're connected to avoid
reconnections
* block seed if it's detected to be our own node (which is often the case for
small private networks)
* add an event for handshaked peers in the server, connected but
non-handshaked ones are not really helpful for MinPeers or GetAddr logic
Fixes#2796.
When the network is big enough, MinPeers may be suboptimal for good network
connectivity, but if we know the network size we can do some estimation on the
number of sufficient peers.
1) It duplicates registration in `version` message handler and no valid
connection can work without version exchange.
2) On public networks we have seed nodes defined by names, so we register
connections to them using these names, but then if connection is dropped we
delist them by IP:PORT combinations which can lead to zero PeerCount() with
all seeds still being registered as connected in the discovery subsystem
and thus no reconnection attempts being made.
It happens from time to time in a four-node private network where there are
seeds (aka CNs) and not a lot of other nodes to connect to.
I don't know how to test for an infinite loop that has no side-effects, so no
test added here.
If the node is to start with seeds unavailable it will try connecting to each
of them three times, blacklist them and then sit forever waiting for
something. It's not a good behavior, it should always try connecting to seeds
if nothing else works.
Why a deadlock can occur:
1. (*DefaultDiscovery).run() has a for loop over requestCh channel.
2. (*DefaultDiscovery).RequestRemote() send to this channel while
holding a mutex.
3. (*DefaultDiscovery).RegisterBadAddr() tries to take mutex for write.
4. Second select-case can't take mutex for read because of (3).
Keeping run() as the owner of all maps would mean adding at least three more
channels to keep address getters with thread-safety. But then there also is a
race between requestToWork() and run() which is way harder to solve with
channels because there are lots of possibilities for deadlocks. So rework all
of this with good old mutexes.
While at it, fix `requestCh` handling in the inner select of run, it will waste
one loop to handle it, so we should add one to the `requested`.
Fixes#445.
Goreport:
neo-go/pkg/core/contract_state_test.go
Line 21: warning: "Contracto" is a misspelling of "Contraction" (misspell)
Line 64: warning: "Contracto" is a misspelling of "Contraction" (misspell)
neo-go/pkg/core/interop_neo.go
Line 420: warning: "succeedes" is a misspelling of "succeeds" (misspell)
neo-go/pkg/network/discovery.go
Line 118: warning: "succeded" is a misspelling of "succeeded" (misspell)
Line 128: warning: "successfuly" is a misspelling of "successfully" (misspell)
...and don't try to connect to the nodes we're already connected to.
Before this change we had a problem of discoverer throwing away good valid
addresses just because they are already known which lead to pool draining over
time (as address reuse was basically forbidden and getaddr may not get enough
new nodes).
* Adds basic RPC supporting files
* Adds interrupt handling and error chan
* Add getblock RPC method
* Update request structure
* Update names of nodes
* Allow bad addresses to be registered in discovery externally
* Small tidy up
* Few tweaks
* Check if error is close error in tcp transport
* Fix tests
* Fix priv port
* Small tweak to param name
* Comment fix
* Remove version from server
* Moves submitblock to TODO block
* Remove old field
* Bumps version and fix hex issues
* block partial persist
* replaced refactored files with old one.
* removed gokit/log from deps
* Tweaks to not overburden remote nodes with getheaders/getblocks
* Changed Transporter interface to not take the server as argument due to a cause of race warning from the compiler
* started server test suite
* more test + return errors from message handlers
* removed --race from build
* Little improvements.