Determine whether an object is compressible #754

Closed
opened 2023-10-25 12:04:13 +00:00 by fyrchik · 4 comments

Currently we use Content-Type attribute and check whether compressed slice is less than the original (thus reducing future read overhead).

This is because zstd does not have any "encodedLen"-like function.
However, this package can be useful https://github.com/klauspost/compress/blob/master/compressible.go#L10

Let's try to benchmark and compare.
This check should be a parameter of "compressor", maybe even put it in config. Defaults will depend on the benchmarks.

Currently we use [Content-Type attribute](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/4239f1e81771affc467ff93376f5d94cf85512be/pkg/local_object_storage/blobstor/compression/compress.go#L43) and check whether compressed slice is [less than the original](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/4239f1e81771affc467ff93376f5d94cf85512be/pkg/local_object_storage/blobstor/compression/compress.go#L87) (thus reducing future read overhead). This is because zstd does not have any "encodedLen"-like function. However, this package can be useful https://github.com/klauspost/compress/blob/master/compressible.go#L10 Let's try to benchmark and compare. This check should be a parameter of "compressor", maybe even put it in config. Defaults will depend on the benchmarks.
fyrchik added the
enhancement
frostfs-node
labels 2023-10-25 12:04:26 +00:00
fyrchik added this to the v0.38.0 milestone 2023-10-25 12:04:33 +00:00
fyrchik changed title from Determine whether the output is compressible to Determine whether an object is compressible 2023-10-25 12:06:32 +00:00
dstepanov-yadro self-assigned this 2023-10-30 14:02:59 +00:00

Tested on https://github.com/MiloszKrajewski/SilesiaCorpus bench data.

estimated := 1.0 - compress.Estimate(data)
actual := float64(len(compressed)) / float64(len(data))

Results:

file    estimated actual 
dickens 0.625     0.365  
mozilla 0.474     0.367  
mr      0.316     0.362  
nci     0.484     0.086  
ooffice 0.581     0.519  
osdb    0.666     0.348  
reymont 0.566     0.297  
samba   0.555     0.232  
sao     0.715     0.796  
webster 0.620     0.295  
x-ray   0.689     0.745  
xml     0.539     0.121 

crypto/rand results

size    estimated actual 
1KB     1.000     1.015  
2KB     1.000     1.007  
4KB     1.000     1.003  
8KB     0.967     1.002  
16KB    0.973     1.001  
32KB    1.000     1.000  
64KB    1.000     1.000  
128KB   0.999     1.000  
256KB   0.984     1.000  
512KB   0.999     1.000  
1024KB  0.999     1.000  
2048KB  0.999     1.000  
4096KB  0.987     1.000  
8192KB  0.992     1.000  
16384KB 0.999     1.000  
32768KB 0.999     1.000 
Tested on https://github.com/MiloszKrajewski/SilesiaCorpus bench data. ``` estimated := 1.0 - compress.Estimate(data) actual := float64(len(compressed)) / float64(len(data)) ``` Results: ``` file estimated actual dickens 0.625 0.365 mozilla 0.474 0.367 mr 0.316 0.362 nci 0.484 0.086 ooffice 0.581 0.519 osdb 0.666 0.348 reymont 0.566 0.297 samba 0.555 0.232 sao 0.715 0.796 webster 0.620 0.295 x-ray 0.689 0.745 xml 0.539 0.121 ``` `crypto/rand` results ``` size estimated actual 1KB 1.000 1.015 2KB 1.000 1.007 4KB 1.000 1.003 8KB 0.967 1.002 16KB 0.973 1.001 32KB 1.000 1.000 64KB 1.000 1.000 128KB 0.999 1.000 256KB 0.984 1.000 512KB 0.999 1.000 1024KB 0.999 1.000 2048KB 0.999 1.000 4096KB 0.987 1.000 8192KB 0.992 1.000 16384KB 0.999 1.000 32768KB 0.999 1.000 ```
Poster
Owner

Can you also add results for the crypto/rand data here? We would like to have some threshold, it seems sth around 0.85 would be perfect.

Can you also add results for the `crypto/rand` data here? We would like to have some threshold, it seems sth around 0.85 would be perfect.

performance bench

goos: linux
goarch: amd64
pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/blobstor/compression
cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
BenchmarkCompressionRealVSEstimate/estimate-8 1000000000 0.4262 ns/op 0 B/op 0 allocs/op
BenchmarkCompressionRealVSEstimate/compress-8 1 1122757487 ns/op 378271392 B/op 507 allocs/op
PASS
ok git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/blobstor/compression 11.916s

performance bench goos: linux goarch: amd64 pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/blobstor/compression cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz BenchmarkCompressionRealVSEstimate/estimate-8 1000000000 0.4262 ns/op 0 B/op 0 allocs/op BenchmarkCompressionRealVSEstimate/compress-8 1 1122757487 ns/op 378271392 B/op 507 allocs/op PASS ok git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/blobstor/compression 11.916s

Can you also add results for the crypto/rand data here? We would like to have some threshold, it seems sth around 0.85 would be perfect.

Done, in original message.

> Can you also add results for the `crypto/rand` data here? We would like to have some threshold, it seems sth around 0.85 would be perfect. Done, in original message.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#754
There is no content yet.