Commit graph

37 commits

Author SHA1 Message Date
Evgenii Stratonikov
921f8b0579 Optimize AVX implementation
1. Do the same mask trick as with AVX2.
2. Get rid of load, generate constant on the fly.

```
name                    old time/op    new time/op    delta
Sum/AVXInline_digest-8    2.26ms ± 4%    2.17ms ± 5%  -4.05%  (p=0.000 n=19+17)

name                    old speed      new speed      delta
Sum/AVXInline_digest-8  44.3MB/s ± 4%  46.2MB/s ± 5%  +4.25%  (p=0.000 n=19+17)
```

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2022-01-17 17:18:36 +03:00
Evgenii Stratonikov
d4cb61e470 Replace two shifts with a single AND
We need to isolate HSB in every quad-word, this can be done with a
simple mask.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2022-01-17 17:18:36 +03:00
Evgenii Stratonikov
a7201418ab Fix linter issues
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
bbbcf3fa5c Use unaligned move in AVX2 implementation
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
c8a32b25ec Optimize AVX2 implementation
We use 6 instructions only to calculate mask based on single bit value.
Use only 3 now and calculate multiple masks in parallel.

Also `VPSUB*` is faster than VPBROADCAST*,
see https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html .

```
name                     old time/op    new time/op    delta
Sum/AVX2Inline_digest-8    1.83ms ± 0%    1.62ms ± 1%  -11.23%  (p=0.000 n=46+42)

name                     old speed      new speed      delta
Sum/AVX2Inline_digest-8  54.7MB/s ± 0%  61.6MB/s ± 1%  +12.65%  (p=0.000 n=46+42)
```

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
a370c525ba Replace all SSE instructions with AVX ones
Also use integer MOV* variant instead of floating-point one.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgeniy Kulikov
77b7d87549
Use golang.org/x/sys instead of self-implemented detector 2020-01-16 11:30:46 +03:00
Evgenii Stratonikov
a8357fda0e Change default AVX implementation 2019-10-16 15:11:57 +03:00
Evgenii Stratonikov
5f74bbc979 Update benchmark result in README.md
Also simplify test's and benchmark's names.
2019-10-16 15:11:57 +03:00
Evgenii Stratonikov
4b7f39cd1d Move mulBitRightx2 to avx2 assembly file 2019-10-16 15:11:57 +03:00
Evgenii Stratonikov
3191f1b3fd Add AVX implementation with inlined multiplication
Perform multiplication by-byte instead of by-bit as
in AVX2Inline implementation.
2019-10-16 15:11:53 +03:00
Evgenii Stratonikov
63834fe8c1 Remove non-AVX parts from avx package
Remove Inv(), Mul1(), And() because right now
they have no AVX optimizations.
2019-10-15 13:22:36 +03:00
Evgenii Stratonikov
0f8b498b58 Alias gf127.GF127 2019-10-15 13:22:36 +03:00
Evgenii Stratonikov
d891a9c591 Restructure code layout
Provide default implementations in gf127 package and
all optimizations in subpackages. This way it will be easier
to use from a client.
2019-10-15 13:22:31 +03:00
Evgenii Stratonikov
1d4e7550fc Use macros in AVX hash implementation 2019-10-10 11:29:40 +03:00
Evgenii Stratonikov
f296adb043 Remove usage of unsafe 2019-10-10 11:04:15 +03:00
Evgenii Stratonikov
782ed7554b Use macros in asm code 2019-10-09 18:11:53 +03:00
Evgenii Stratonikov
fc059cac87 Use AVX2 only if AVX is also present 2019-10-09 18:03:39 +03:00
Evgenii Stratonikov
648b1deca7 Move cpuid facility to separate package 2019-10-09 18:03:35 +03:00
Evgenii Stratonikov
f613ab2c25 Implement matrix multiplication with pure Go
Set suitable backend for GF127 arithmetic for Concat(), Sum() etc.
2019-10-09 12:31:47 +03:00
Evgenii Stratonikov
38df9b2c63 Detect CPU features in Sum() 2019-10-04 17:58:42 +03:00
Evgenii
63e8eeac86 Determine available features through CPUID 2019-09-04 11:47:44 +03:00
Evgenii
7c12188650 Perform allocation outside of mulBitRightPure 2019-07-19 19:04:44 +03:00
Evgenii
6c75cc0871 Add pure Go hash implementation 2019-07-19 18:59:43 +03:00
Evgenii
c3cfe63e64 Add possibility to use different implementations in cli
Also make API smaller and more consistent and fix typos in documentation.
2019-07-19 18:24:30 +03:00
Evgenii
c68e38b943 Inline asm function in loop for AVX2 implementation
Right now AVX2 implementation looses to C binding in speed.
This is probably, because of 2 things:
1. Go does not inline `mulBitRightx2` in loop iteration.
2. `minmax` is loaded every time from memory.

In this PR:
1. Unroll `mulBitRightx2` manually and use `mulByteRightx2` instead.
2. Generate `minmax` in place without `LOAD/LEA` instructions.
2019-07-19 16:11:06 +03:00
Evgenii
bd43de6056 Report benchmark results in MB/s 2019-07-10 12:07:54 +03:00
Evgenii
ad8c7bce1b Fix type assertions 2019-06-24 10:07:16 +03:00
Evgenii
e1d9fc8058 Use testify in tests 2019-06-21 23:18:16 +03:00
Evgenii
4b11f50264 Fix error in AVX2 implementation 2019-06-21 23:10:08 +03:00
Evgenii
eaeceead2f Add benchmarks 2019-06-21 22:40:17 +03:00
Evgenii
9485f49f3b Get rid of unsafe usage and add tests 2019-06-21 22:32:32 +03:00
Evgenii
a967cc9d3d Make use of AVX2 in Sum() by default 2019-06-21 18:47:01 +03:00
Evgeniy Kulikov
6b644651fa
Rewrite tests (#3)
- rewrite tests
- remove gomega from deps
2019-05-29 14:10:17 +03:00
Evgenii
d5efd8bdce add SubtractR/L operation on hashes
- add Inverse operation to sl2
- fix a bug in xN()
2019-01-29 16:11:50 +03:00
Evgeniy Kulikov
42499b9eb0
Fix formatting 2019-01-03 11:04:43 +03:00
Evgeniy Kulikov
5cf44c62ac
Initial 2018-12-29 16:04:17 +03:00