Small (especially dockerized/virtualized) networks often start all nodes at
ones and then we see a lot of connection flapping in the log. This happens
because nodes try to connect to each other simultaneously, establish two
connections, then each one finds a duplicate and drops it, but this can be
different duplicate connections on other sides, so they retry and it all
happens for some time. Eventually everything settles, but we have a lot of
garbage in the log and a lot of useless attempts.
This random waiting timeout doesn't change the logic much, adds a minimal
delay, but increases chances for both nodes to establish a proper single
connection on both sides to only then see another one and drop it on both
sides as well. It leads to almost no flapping in small networks, doesn't
affect much bigger ones. The delay is close to unnoticeable especially if
there is something in the DB for node to process during startup.
* treat connected/handshaked peers separately in the discoverer, save
"original" address for connected ones, it can be a name instead of IP and
it's important to keep it to avoid reconnections
* store name->IP mapping for seeds if and when they're connected to avoid
reconnections
* block seed if it's detected to be our own node (which is often the case for
small private networks)
* add an event for handshaked peers in the server, connected but
non-handshaked ones are not really helpful for MinPeers or GetAddr logic
Fixes#2796.
Asynchronous tryAddress() routines may get dial result AFTER the switch to
another test, so we need to ensure that they'll get the result intended for
this particular call. Fixes:
2021-07-07T20:25:40.1624521Z === RUN TestDefaultDiscoverer
2021-07-07T20:25:40.1625316Z discovery_test.go:159: timeout expecting for transport dial; i: 2, j: 1
2021-07-07T20:25:40.1626319Z --- FAIL: TestDefaultDiscoverer (1.19s)
We don't care much about dialing, but the same constant is used in outer
discoverer loop in case no connections are established and we have no
connections established.
1) It duplicates registration in `version` message handler and no valid
connection can work without version exchange.
2) On public networks we have seed nodes defined by names, so we register
connections to them using these names, but then if connection is dropped we
delist them by IP:PORT combinations which can lead to zero PeerCount() with
all seeds still being registered as connected in the discovery subsystem
and thus no reconnection attempts being made.
If the node is to start with seeds unavailable it will try connecting to each
of them three times, blacklist them and then sit forever waiting for
something. It's not a good behavior, it should always try connecting to seeds
if nothing else works.
Keeping run() as the owner of all maps would mean adding at least three more
channels to keep address getters with thread-safety. But then there also is a
race between requestToWork() and run() which is way harder to solve with
channels because there are lots of possibilities for deadlocks. So rework all
of this with good old mutexes.
While at it, fix `requestCh` handling in the inner select of run, it will waste
one loop to handle it, so we should add one to the `requested`.
Fixes#445.