Helps to get rid of MOV and generating constants for each iteration.
```
name old time/op new time/op delta
Sum/AVX2Inline_digest-8 1.57ms ± 2% 1.41ms ± 0% -10.52% (p=0.000 n=9+9)
name old speed new speed delta
Sum/AVX2Inline_digest-8 63.6MB/s ± 1% 71.1MB/s ± 0% +11.76% (p=0.000 n=9+9)
```
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
1. Do the same mask trick as with AVX2.
2. Get rid of load, generate constant on the fly.
```
name old time/op new time/op delta
Sum/AVXInline_digest-8 2.26ms ± 4% 2.17ms ± 5% -4.05% (p=0.000 n=19+17)
name old speed new speed delta
Sum/AVXInline_digest-8 44.3MB/s ± 4% 46.2MB/s ± 5% +4.25% (p=0.000 n=19+17)
```
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
We use 6 instructions only to calculate mask based on single bit value.
Use only 3 now and calculate multiple masks in parallel.
Also `VPSUB*` is faster than VPBROADCAST*,
see https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html .
```
name old time/op new time/op delta
Sum/AVX2Inline_digest-8 1.83ms ± 0% 1.62ms ± 1% -11.23% (p=0.000 n=46+42)
name old speed new speed delta
Sum/AVX2Inline_digest-8 54.7MB/s ± 0% 61.6MB/s ± 1% +12.65% (p=0.000 n=46+42)
```
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Right now AVX2 implementation looses to C binding in speed.
This is probably, because of 2 things:
1. Go does not inline `mulBitRightx2` in loop iteration.
2. `minmax` is loaded every time from memory.
In this PR:
1. Unroll `mulBitRightx2` manually and use `mulByteRightx2` instead.
2. Generate `minmax` in place without `LOAD/LEA` instructions.