Some tests are failing on Windows due to slow runners with errors like the following:
```
2022-02-09T17:11:20.3127016Z --- FAIL: TestGetData/transaction (1.82s)
2022-02-09T17:11:20.3127385Z server_test.go:500:
2022-02-09T17:11:20.3127878Z Error Trace: server_test.go:500
2022-02-09T17:11:20.3128533Z server_test.go:520
2022-02-09T17:11:20.3128978Z Error: Condition never satisfied
2022-02-09T17:11:20.3129479Z Test: TestGetData/transaction
```
Consensus can require conflicting transactions and it can require more
transactions than mempool can fit, all of this should work. Transactions will
be checked anyway using its secondary mempool. See the scenario from #668.
Notice that it makes the node accept Extensible payloads with any category
which is the same way C# node works. We're trusting Extensible senders,
improper payloads are harmless until they DoS the network, but we have some
protections against that too (and spamming with proper category doesn't differ
a lot).
Use circular buffer which is a bit more appropriate. The problem is that
priority queue accepts and stores equal items which wastes memory even in
normal usage scenario, but it's especially dangerous if the node is stuck for
some reason. In this case it'll accept from peers and put into queue the same
blocks again and again leaking memory up to OOM condition.
Notice that queue length calculation might be wrong in case circular buffer
wraps, but it's not very likely to happen (usually blocks not coming from the
queue are added by consensus and it's not very fast in doing so).
In this commit:
1. Request unknown MPT nodes from peers. Note, that StateSync module itself
shouldn't be responsible for nodes requests, that's a server duty.
2. Do not request the same node twice, check if it is in storage
already. If so, then the only thing remaining is to update refcounter.
It requires only two methods from Blockchainer: AddBlock and
BlockHeight. New interface will allow to easily reuse the block queue
for state exchange purposes.
Refactor code and be fine with sending to just 2/3 of proper peers. Previously
it was an edge case, but it can be a normal thing to do also as broadcasting
to everyone is obviously too expensive and excessive (hi, #608).
Baseline (four node, 10 workers):
RPS 8180.760 8137.822 7858.358 7820.011 8051.076 ≈ 8010 ± 2.04%
TPS 7819.831 7521.172 7519.023 7242.965 7426.000 ≈ 7506 ± 2.78%
CPU % 41.983 38.775 40.606 39.375 35.537 ≈ 39.3 ± 6.15%
Mem MB 2947.189 2743.658 2896.688 2813.276 2863.108 ≈ 2853 ± 2.74%
Patched:
RPS 9714.567 9676.102 9358.609 9371.408 9301.372 ≈ 9484 ± 2.05% ↑ 18.40%
TPS 8809.796 8796.854 8534.754 8661.158 8426.162 ≈ 8646 ± 1.92% ↑ 15.19%
CPU % 44.980 45.018 33.640 29.645 43.830 ≈ 39.4 ± 18.41% ↑ 0.25%
Mem MB 2989.078 2976.577 2306.185 2351.929 2910.479 ≈ 2707 ± 12.80% ↓ 5.12%
There is a nuance with this patch however. While typically it works the way
outlined above, sometimes it works like this:
RPS ≈ 6734.368
TPS ≈ 6299.332
CPU ≈ 25.552%
Mem ≈ 2706.046MB
And that's because the log looks like this:
DeltaTime, TransactionsCount, TPS
5014, 44212, 8817.710
5163, 49690, 9624.249
5166, 49523, 9586.334
5189, 49693, 9576.604
5198, 49339, 9491.920
5147, 49559, 9628.716
5192, 49680, 9568.567
5163, 49750, 9635.871
5183, 49189, 9490.450
5159, 49653, 9624.540
5167, 47945, 9279.079
5179, 2051, 396.022
5015, 4, 0.798
5004, 0, 0.000
5003, 0, 0.000
5003, 0, 0.000
5003, 0, 0.000
5003, 0, 0.000
5004, 0, 0.000
5003, 2925, 584.649
5040, 49099, 9741.865
5161, 49718, 9633.404
5170, 49228, 9521.857
5179, 49773, 9610.543
5167, 47253, 9145.152
5202, 49788, 9570.934
5177, 47704, 9214.603
5209, 46610, 8947.975
5249, 49156, 9364.831
5163, 18284, 3541.352
5072, 174, 34.306
On a network with 4 CNs and 1 RPC node there is 1/256 probability that a block
won't be broadcasted to RPC node, so it won't see it until ping timeout kicks
in. While it doesn't see a block it can't accept new incoming transactions so
the bench gets stuck basically. To me that's an acceptable trade-off because
normal networks are much larger than that and the effect of this patch is way
more important there, but still that's what we have and we need to take into
account.
send() can return errStateMismatch, errGone and errBusy. errGone means the
peer is dead and it won't ever be active again, it doesn't make sense retrying
sends to it. errStateMismatch is technically "not yet ready", but we can't
wait for it either, no one knows how much will it take to complete
handshake. So only errBusy means we can retry.
So keep track of dead peers and adjust tries counting appropriately.
It doesn't change much, we can't magically get more valid peers and if some
die while we're iterating we'd detect that by an error returned from send().