send() can return errStateMismatch, errGone and errBusy. errGone means the
peer is dead and it won't ever be active again, it doesn't make sense retrying
sends to it. errStateMismatch is technically "not yet ready", but we can't
wait for it either, no one knows how much will it take to complete
handshake. So only errBusy means we can retry.
So keep track of dead peers and adjust tries counting appropriately.
It doesn't change much, we can't magically get more valid peers and if some
die while we're iterating we'd detect that by an error returned from send().
When transaction spreads through the network many nodes are likely to get it
in roughly the same time. They will rebroadcast it also in roughly the same
time. As we have a number of peers it's quite likely that we'd get an Inv with
the same transaction from multiple peers simultaneously. We will ask them for
this transaction (independently!) and again we're likely to get it in roughly
the same time. So we can easily end up with multiple threads processing the
same transaction. Only one will succeed, but we can actually easily avoid
doing it in the first place saving some CPU cycles for other things.
Notice that we can't do it _before_ receiving a transaction because nothing
guarantees that the peer will respond to our transaction request, so
communication overhead is unavoidable at the moment, but saving on processing
already gives quite interesting results.
Baseline, four nodes with 10 workers:
RPS 7176.784 7014.511 6139.663 7191.280 7080.852 ≈ 6921 ± 5.72%
TPS 6945.409 6562.756 5927.050 6681.187 6821.794 ≈ 6588 ± 5.38%
CPU % 44.400 43.842 40.418 49.211 49.370 ≈ 45.4 ± 7.53%
Mem MB 2693.414 2640.602 2472.007 2731.482 2707.879 ≈ 2649 ± 3.53%
Patched:
RPS ≈ 7791.675 7996.559 7834.504 7746.705 7891.614 ≈ 7852 ± 1.10% ↑ 13.45%
TPS ≈ 7241.497 7711.765 7520.211 7425.890 7334.443 ≈ 7447 ± 2.17% ↑ 13.04%
CPU % 29.853 39.936 39.945 36.371 39.999 ≈ 37.2 ± 10.57% ↓ 18.06%
Mem MB 2749.635 2791.609 2828.610 2910.431 2863.344 ≈ 2829 ± 1.97% ↑ 6.80%
Most of the time on healthy network we see new transactions appearing that are
not present in the mempool. Once they get into mempool we don't ask for them
again when some other peer sends an Inv with them. Then these transactions are
usually added into block, removed from mempool and no one actually sends them
again to us. Some stale nodes can do that, but it's not very likely to
happen.
At the receiving end at the same time it's quite expensive to do full chain
HasTransaction() query, so if we can avoid doing that it's always good. Here
it technically allows resending old transaction that will be re-requested and
an attempt to add it to mempool will be made. But it'll inevitably fail
because the same HasTransaction() check is done there too. One can try to
maliciously flood the node with stale transactions but it doesn't differ from
flooding it with any other invalid transactions, so there is no new attack
vector added.
Baseline, 4 nodes with 10 workers:
RPS 6902.296 6465.662 6856.044 6785.515 6157.024 ≈ 6633 ± 4.26%
TPS 6468.431 6218.867 6610.565 6288.596 5790.556 ≈ 6275 ± 4.44%
CPU % 50.231 42.925 49.481 48.396 42.662 ≈ 46.7 ± 7.01%
Mem MB 2856.841 2684.103 2756.195 2733.485 2422.787 ≈ 2691 ± 5.40%
Patched:
RPS 7176.784 7014.511 6139.663 7191.280 7080.852 ≈ 6921 ± 5.72% ↑ 4.34%
TPS 6945.409 6562.756 5927.050 6681.187 6821.794 ≈ 6588 ± 5.38% ↑ 4.99%
CPU % 44.400 43.842 40.418 49.211 49.370 ≈ 45.4 ± 7.53% ↓ 2.78%
Mem MB 2693.414 2640.602 2472.007 2731.482 2707.879 ≈ 2649 ± 3.53% ↓ 1.56%
There was a deadlock while trying to finalize transaction during
PostBlock:
1) (*Notary).PostBlock is called under the blockchain lock
2) (*Notary).onTransaction is called inside the PostBlock
3) (*Notary).onTransaction needs to RLock the blockchain to add
completed transaction to the memory pool (and the blockchain is Lock'ed
by this moment)
The problem is fixed by using notifications subsistem, because it's not
required to call (*Notary).PostBlock under the blockchain lock.
And stop dropping connections if we're to receive them. Proper handling is
subject of #1701, but we need at least some connection-level stability for
now.
Node receiving extensible payload from the future is confused and drops
connection. Note that this can still happen if the node is to loose its
synchrony.
Calling `IsInSync()` is quite expensive, so we stop doing that once synchrony
is reached (hence bool flag).
1. Initialization is performed via `Blockchain` methods.
2. Native Oracle contract updates list of oracle nodes
and in-fly requests in `PostPersist`.
3. RPC uses Oracle module directly.
It could be the case that checks are performed simultaneosly and
peers connections goes down from 2 to 0. We must take such case into
account and register address as good in discovery.
Right now a single slow peer can slow down whole network.
Do broadcast in 2 parts:
1. Perform non-blocking send to all peers if possible.
2. Perform blocking sends until message is sent to 2/3 of good peers.