tzhash/tz
Evgenii Stratonikov c8a32b25ec Optimize AVX2 implementation
We use 6 instructions only to calculate mask based on single bit value.
Use only 3 now and calculate multiple masks in parallel.

Also `VPSUB*` is faster than VPBROADCAST*,
see https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html .

```
name                     old time/op    new time/op    delta
Sum/AVX2Inline_digest-8    1.83ms ± 0%    1.62ms ± 1%  -11.23%  (p=0.000 n=46+42)

name                     old speed      new speed      delta
Sum/AVX2Inline_digest-8  54.7MB/s ± 0%  61.6MB/s ± 1%  +12.65%  (p=0.000 n=46+42)
```

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
..
avx.go Alias gf127.GF127 2019-10-15 13:22:36 +03:00
avx2.go Alias gf127.GF127 2019-10-15 13:22:36 +03:00
avx2_amd64.s Optimize AVX2 implementation 2021-12-29 13:23:05 +03:00
avx2_inline.go Alias gf127.GF127 2019-10-15 13:22:36 +03:00
avx_amd64.s Replace all SSE instructions with AVX ones 2021-12-29 13:23:05 +03:00
avx_inline.go Add AVX implementation with inlined multiplication 2019-10-16 15:11:53 +03:00
hash.go Use golang.org/x/sys instead of self-implemented detector 2020-01-16 11:30:46 +03:00
hash_test.go Update benchmark result in README.md 2019-10-16 15:11:57 +03:00
pure.go Alias gf127.GF127 2019-10-15 13:22:36 +03:00
sl2.go Remove non-AVX parts from avx package 2019-10-15 13:22:36 +03:00
sl2_test.go Remove non-AVX parts from avx package 2019-10-15 13:22:36 +03:00