network: merge two loops in iteratePeersWithSendMsg, send to 2/3

Refactor code and be fine with sending to just 2/3 of proper peers. Previously
it was an edge case, but it can be a normal thing to do also as broadcasting
to everyone is obviously too expensive and excessive (hi, #608).

Baseline (four node, 10 workers):

RPS    8180.760 8137.822 7858.358 7820.011 8051.076 ≈ 8010   ± 2.04%
TPS    7819.831 7521.172 7519.023 7242.965 7426.000 ≈ 7506   ± 2.78%
CPU %    41.983   38.775   40.606   39.375   35.537 ≈   39.3 ± 6.15%
Mem MB 2947.189 2743.658 2896.688 2813.276 2863.108 ≈ 2853   ± 2.74%

Patched:

RPS    9714.567 9676.102 9358.609 9371.408 9301.372 ≈ 9484   ±  2.05% ↑ 18.40%
TPS    8809.796 8796.854 8534.754 8661.158 8426.162 ≈ 8646   ±  1.92% ↑ 15.19%
CPU %    44.980   45.018   33.640   29.645   43.830 ≈   39.4 ± 18.41% ↑  0.25%
Mem MB 2989.078 2976.577 2306.185 2351.929 2910.479 ≈ 2707   ± 12.80% ↓  5.12%

There is a nuance with this patch however. While typically it works the way
outlined above, sometimes it works like this:

RPS ≈ 6734.368
TPS ≈ 6299.332
CPU ≈ 25.552%
Mem ≈ 2706.046MB

And that's because the log looks like this:

DeltaTime, TransactionsCount, TPS
5014, 44212, 8817.710
5163, 49690, 9624.249
5166, 49523, 9586.334
5189, 49693, 9576.604
5198, 49339, 9491.920
5147, 49559, 9628.716
5192, 49680, 9568.567
5163, 49750, 9635.871
5183, 49189, 9490.450
5159, 49653, 9624.540
5167, 47945, 9279.079
5179, 2051, 396.022
5015, 4, 0.798
5004, 0, 0.000
5003, 0, 0.000
5003, 0, 0.000
5003, 0, 0.000
5003, 0, 0.000
5004, 0, 0.000
5003, 2925, 584.649
5040, 49099, 9741.865
5161, 49718, 9633.404
5170, 49228, 9521.857
5179, 49773, 9610.543
5167, 47253, 9145.152
5202, 49788, 9570.934
5177, 47704, 9214.603
5209, 46610, 8947.975
5249, 49156, 9364.831
5163, 18284, 3541.352
5072, 174, 34.306

On a network with 4 CNs and 1 RPC node there is 1/256 probability that a block
won't be broadcasted to RPC node, so it won't see it until ping timeout kicks
in. While it doesn't see a block it can't accept new incoming transactions so
the bench gets stuck basically. To me that's an acceptable trade-off because
normal networks are much larger than that and the effect of this patch is way
more important there, but still that's what we have and we need to take into
account.
This commit is contained in:
Roman Khimov 2021-08-06 15:36:46 +03:00
parent 966a16e80e
commit 7bb82f1f99

View file

@ -1164,44 +1164,29 @@ func (s *Server) iteratePeersWithSendMsg(msg *Message, send func(Peer, bool, []b
// have already sent an Inv to it.
finished := make([]bool, peerN)
for i, peer := range peers {
err := send(peer, false, pkt)
switch err {
case nil:
if msg.Command == CMDGetAddr {
peer.AddGetAddrSent()
// Try non-blocking sends first and only block if have to.
for _, blocking := range []bool{false, true} {
for i, peer := range peers {
// Send to 2/3 of good peers.
if 3*sentN >= 2*(peerN-deadN) {
return
}
sentN++
case errBusy:
continue
default:
deadN++
}
finished[i] = true
}
// Send to at least 2/3 of good peers.
if 3*sentN >= 2*(peerN-deadN) {
return
}
// Perform blocking send now.
for i, peer := range peers {
if finished[i] {
continue
}
if err := send(peer, true, pkt); err != nil {
if err != errBusy {
if finished[i] {
continue
}
err := send(peer, blocking, pkt)
switch err {
case nil:
if msg.Command == CMDGetAddr {
peer.AddGetAddrSent()
}
sentN++
case errBusy: // Can be retried.
continue
default:
deadN++
}
continue
}
if msg.Command == CMDGetAddr {
peer.AddGetAddrSent()
}
sentN++
if 3*sentN >= 2*(peerN-deadN) {
return
finished[i] = true
}
}
}