Commit graph

7 commits

Author SHA1 Message Date
Evgenii Stratonikov
73d978c31e Rewrite AVX2 loop in assembly
Helps to get rid of MOV and generating constants for each iteration.

```
name                     old time/op    new time/op    delta
Sum/AVX2Inline_digest-8    1.57ms ± 2%    1.41ms ± 0%  -10.52%  (p=0.000 n=9+9)

name                     old speed      new speed      delta
Sum/AVX2Inline_digest-8  63.6MB/s ± 1%  71.1MB/s ± 0%  +11.76%  (p=0.000 n=9+9)
```

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2022-01-17 17:18:36 +03:00
Evgenii Stratonikov
d7c96f5d2e Fix comments
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2022-01-17 17:18:36 +03:00
Evgenii Stratonikov
8dd24d0195 Interleave carry registers for successive bits
8 instructions less per byte.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2022-01-17 17:18:36 +03:00
Evgenii Stratonikov
d4cb61e470 Replace two shifts with a single AND
We need to isolate HSB in every quad-word, this can be done with a
simple mask.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2022-01-17 17:18:36 +03:00
Evgenii Stratonikov
bbbcf3fa5c Use unaligned move in AVX2 implementation
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
c8a32b25ec Optimize AVX2 implementation
We use 6 instructions only to calculate mask based on single bit value.
Use only 3 now and calculate multiple masks in parallel.

Also `VPSUB*` is faster than VPBROADCAST*,
see https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html .

```
name                     old time/op    new time/op    delta
Sum/AVX2Inline_digest-8    1.83ms ± 0%    1.62ms ± 1%  -11.23%  (p=0.000 n=46+42)

name                     old speed      new speed      delta
Sum/AVX2Inline_digest-8  54.7MB/s ± 0%  61.6MB/s ± 1%  +12.65%  (p=0.000 n=46+42)
```

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
4b7f39cd1d Move mulBitRightx2 to avx2 assembly file 2019-10-16 15:11:57 +03:00
Renamed from tz/avx2_inline_amd64.s (Browse further)