1. Perform masking with 2 instructions instead of 3 (use arithmetic
shift).
2. Broadcast data byte in one instruction at the start of byte-processing
3. Reorder instructions to reduce the amount of data hazards and resources
contention.
```
name old time/op new time/op delta
Sum/AVX2_digest-8 1.39ms ± 0% 1.22ms ± 0% -12.18% (p=0.000 n=9+7)
name old speed new speed delta
Sum/AVX2_digest-8 71.7MB/s ± 0% 81.7MB/s ± 0% +13.87% (p=0.000 n=9+7)
```
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>