Commit graph

68 commits

Author SHA1 Message Date
Evgenii Stratonikov
921f8b0579 Optimize AVX implementation
1. Do the same mask trick as with AVX2.
2. Get rid of load, generate constant on the fly.

```
name                    old time/op    new time/op    delta
Sum/AVXInline_digest-8    2.26ms ± 4%    2.17ms ± 5%  -4.05%  (p=0.000 n=19+17)

name                    old speed      new speed      delta
Sum/AVXInline_digest-8  44.3MB/s ± 4%  46.2MB/s ± 5%  +4.25%  (p=0.000 n=19+17)
```

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2022-01-17 17:18:36 +03:00
Evgenii Stratonikov
d4cb61e470 Replace two shifts with a single AND
We need to isolate HSB in every quad-word, this can be done with a
simple mask.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2022-01-17 17:18:36 +03:00
Evgenii Stratonikov
a7201418ab Fix linter issues
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
2e922115d8 Replace CircleCI with Github actions
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
bbbcf3fa5c Use unaligned move in AVX2 implementation
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
c8a32b25ec Optimize AVX2 implementation
We use 6 instructions only to calculate mask based on single bit value.
Use only 3 now and calculate multiple masks in parallel.

Also `VPSUB*` is faster than VPBROADCAST*,
see https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html .

```
name                     old time/op    new time/op    delta
Sum/AVX2Inline_digest-8    1.83ms ± 0%    1.62ms ± 1%  -11.23%  (p=0.000 n=46+42)

name                     old speed      new speed      delta
Sum/AVX2Inline_digest-8  54.7MB/s ± 0%  61.6MB/s ± 1%  +12.65%  (p=0.000 n=46+42)
```

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
a370c525ba Replace all SSE instructions with AVX ones
Also use integer MOV* variant instead of floating-point one.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 13:23:05 +03:00
Evgenii Stratonikov
9b3f45993f gf127: remove branch in pure Go operations
```
name                 old time/op    new time/op    delta
Sum/PureGo_digest-8    16.1ms ± 3%    10.4ms ± 3%  -35.53%  (p=0.000 n=10+10)

name                 old speed      new speed      delta
Sum/PureGo_digest-8  6.22MB/s ± 3%  9.65MB/s ± 3%  +55.12%  (p=0.000 n=10+10)
```

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
2021-12-29 11:00:27 +03:00
fyrchik
33bf778066
Merge pull request #20 from nspcc-dev/use-x-sys-cpu-instead-of-self-implemented
Use golang.org/x/sys instead of self-implemented detector
2020-01-16 11:32:58 +03:00
Evgeniy Kulikov
77b7d87549
Use golang.org/x/sys instead of self-implemented detector 2020-01-16 11:30:46 +03:00
Evgeniy Kulikov
d4b45131cd
Update alpine image, fixup for Makefile, fixup for benchmark 2020-01-16 11:30:46 +03:00
Evgeniy Kulikov
9789dcb2b6
Ignore vendor and binary 2020-01-16 11:30:45 +03:00
fyrchik
3d96a71c03
Merge pull request #19 from nspcc-dev/feat/avx_inline
Speed up AVX implementation
2019-10-17 17:53:41 +03:00
Evgenii Stratonikov
a8357fda0e Change default AVX implementation 2019-10-16 15:11:57 +03:00
Evgenii Stratonikov
5f74bbc979 Update benchmark result in README.md
Also simplify test's and benchmark's names.
2019-10-16 15:11:57 +03:00
Evgenii Stratonikov
4b7f39cd1d Move mulBitRightx2 to avx2 assembly file 2019-10-16 15:11:57 +03:00
Evgenii Stratonikov
3191f1b3fd Add AVX implementation with inlined multiplication
Perform multiplication by-byte instead of by-bit as
in AVX2Inline implementation.
2019-10-16 15:11:53 +03:00
fyrchik
702d2553ba
Merge pull request #18 from nspcc-dev/feat/interface
Restructure code layout in gf127/
2019-10-15 14:19:13 +03:00
Evgenii Stratonikov
63834fe8c1 Remove non-AVX parts from avx package
Remove Inv(), Mul1(), And() because right now
they have no AVX optimizations.
2019-10-15 13:22:36 +03:00
Evgenii Stratonikov
0f8b498b58 Alias gf127.GF127 2019-10-15 13:22:36 +03:00
Evgenii Stratonikov
d891a9c591 Restructure code layout
Provide default implementations in gf127 package and
all optimizations in subpackages. This way it will be easier
to use from a client.
2019-10-15 13:22:31 +03:00
Evgenii Stratonikov
c5fb08aece Speed up gogf127.Mul()
Cache results of the shift. Also add test for checking if
implementation can work when result is one of the arguments.
2019-10-11 11:50:33 +03:00
fyrchik
b27c17ce19
Merge pull request #17 from nspcc-dev/fix/refactoring
Remove `unsafe` from code
2019-10-10 12:48:58 +03:00
Evgenii Stratonikov
1d4e7550fc Use macros in AVX hash implementation 2019-10-10 11:29:40 +03:00
Evgenii Stratonikov
f296adb043 Remove usage of unsafe 2019-10-10 11:04:15 +03:00
fyrchik
5142f695cf
Merge pull request #16 from nspcc-dev/feat/cpuid
Move cpu id to a separate package
2019-10-09 18:18:41 +03:00
Evgenii Stratonikov
782ed7554b Use macros in asm code 2019-10-09 18:11:53 +03:00
Evgenii Stratonikov
43033eedb1 Provide minimum go version in go.mod 2019-10-09 18:06:26 +03:00
Evgenii Stratonikov
fc059cac87 Use AVX2 only if AVX is also present 2019-10-09 18:03:39 +03:00
Evgenii Stratonikov
648b1deca7 Move cpuid facility to separate package 2019-10-09 18:03:35 +03:00
fyrchik
2470efda43
Merge pull request #15 from nspcc-dev/fix/cpu_features
Implement matrix multiplication with pure Go
2019-10-09 17:42:04 +03:00
Evgenii Stratonikov
f613ab2c25 Implement matrix multiplication with pure Go
Set suitable backend for GF127 arithmetic for Concat(), Sum() etc.
2019-10-09 12:31:47 +03:00
Evgeniy Kulikov
06362477ed
Merge pull request #13 from nspcc-dev/fix/cpuid
Detect CPU features in Sum()
2019-10-04 18:00:24 +03:00
Evgenii Stratonikov
38df9b2c63 Detect CPU features in Sum() 2019-10-04 17:58:42 +03:00
fyrchik
083d0ff054
Merge pull request #12 from nspcc-dev/feat/cpu_features
Determine available features through CPUID
2019-09-04 12:01:42 +03:00
Evgenii
63e8eeac86 Determine available features through CPUID 2019-09-04 11:47:44 +03:00
fyrchik
16d4da0a1d
Merge pull request #11 from nspcc-dev/feature/pure_go
Implement hashing in pure go
2019-09-04 10:52:02 +03:00
Evgenii
7c12188650 Perform allocation outside of mulBitRightPure 2019-07-19 19:04:44 +03:00
Evgenii
6c75cc0871 Add pure Go hash implementation 2019-07-19 18:59:43 +03:00
fyrchik
33f1403c28
Merge pull request #10 from nspcc-dev/feature/API_refactor
Add possibility to use different implementations in cli
2019-07-19 18:26:26 +03:00
Evgenii
c3cfe63e64 Add possibility to use different implementations in cli
Also make API smaller and more consistent and fix typos in documentation.
2019-07-19 18:24:30 +03:00
fyrchik
826ed77561
Merge pull request #9 from nspcc-dev/feature/AVX2_inline
Inline asm function in loop for AVX2 implementation
2019-07-19 17:54:25 +03:00
Evgenii
c68e38b943 Inline asm function in loop for AVX2 implementation
Right now AVX2 implementation looses to C binding in speed.
This is probably, because of 2 things:
1. Go does not inline `mulBitRightx2` in loop iteration.
2. `minmax` is loaded every time from memory.

In this PR:
1. Unroll `mulBitRightx2` manually and use `mulByteRightx2` instead.
2. Generate `minmax` in place without `LOAD/LEA` instructions.
2019-07-19 16:11:06 +03:00
fyrchik
dd15c90530
Merge pull request #8 from nspcc-dev/pureGo
Add pure-go GF(2^127) implementation
2019-07-19 12:06:24 +03:00
Evgenii
5c2544cf3b Add pure-go GF(2^127) implementation 2019-07-19 12:04:16 +03:00
fyrchik
5c06a9fa8f
Merge pull request #7 from nspcc-dev/feat/mbpers
Report benchmark results in MB/s
2019-07-10 13:15:00 +03:00
Evgenii
bd43de6056 Report benchmark results in MB/s 2019-07-10 12:07:54 +03:00
fyrchik
62a3dafe71
Merge pull request #6 from nspcc-dev/fix/tests
Use testify/require for testing
2019-06-24 11:58:28 +03:00
Evgenii
9a258e8741 Add test for marshalling/unmarshalling 2019-06-24 11:02:42 +03:00
Evgenii
d9e26aa6de Use testify/require for testing 2019-06-24 10:56:15 +03:00