WIP: Experimental bitcask-based writecache implementation #654

Closed
ale64bit wants to merge 1 commit from ale64bit/frostfs-node:feature/wc-experiment into master
Member

Signed-off-by: Alejandro Lopez a.lopez@yadro.com

Signed-off-by: Alejandro Lopez <a.lopez@yadro.com>
ale64bit force-pushed feature/wc-experiment from 0f4b63a027 to 83a05579b3 2023-08-28 11:11:21 +00:00 Compare
ale64bit force-pushed feature/wc-experiment from 83a05579b3 to 9aa44a1790 2023-08-28 11:30:11 +00:00 Compare
ale64bit force-pushed feature/wc-experiment from 9aa44a1790 to 30fc6b56bd 2023-08-30 11:32:07 +00:00 Compare
ale64bit force-pushed feature/wc-experiment from 30fc6b56bd to cf50a6b9c6 2023-08-31 09:29:57 +00:00 Compare
ale64bit force-pushed feature/wc-experiment from cf50a6b9c6 to f8583060ec 2023-08-31 10:28:19 +00:00 Compare
ale64bit changed title from WIP: [#xx] Experimental lite writecache implementation to Experimental bitcask-based writecache implementation 2023-08-31 10:29:03 +00:00
ale64bit requested review from storage-core-committers 2023-08-31 10:29:18 +00:00
ale64bit requested review from storage-core-developers 2023-08-31 10:29:19 +00:00
Author
Member

Results from basic benchmarks:

$ go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecacheSeq/bitcask_seq         	   38026	   1864616 ns/op	   49900 B/op	      77 allocs/op
BenchmarkWritecacheSeq/bbolt_seq           	    6319	  11320495 ns/op	   93123 B/op	     167 allocs/op
BenchmarkWritecacheSeq/badger_seq          	   12699	   5128271 ns/op	17288137 B/op	    2175 allocs/op

$ go test -benchmem -run=^$ -benchtime 1m -cpu 4 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-4         	  135234	    553496 ns/op	   47069 B/op	      74 allocs/op
BenchmarkWritecachePar/bbolt_par-4           	   24415	   2947224 ns/op	   91298 B/op	     149 allocs/op
BenchmarkWritecachePar/badger_par-4          	   10000	   6544963 ns/op	32912089 B/op	    2112 allocs/op

$ go test -benchmem -run=^$ -benchtime 1m -cpu 8 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-8         	  281928	    266495 ns/op	   45196 B/op	      72 allocs/op
BenchmarkWritecachePar/bbolt_par-8           	   46563	   1524095 ns/op	   90591 B/op	     147 allocs/op
BenchmarkWritecachePar/badger_par-8          	   17053	   4623445 ns/op	29140321 B/op	    3067 allocs/op

$ go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-32         	 1000000	     71073 ns/op	   46933 B/op	      72 allocs/op
BenchmarkWritecachePar/bbolt_par-32           	  153822	    506729 ns/op	   82818 B/op	     141 allocs/op
BenchmarkWritecachePar/badger_par-32          	   83230	   1073903 ns/op	 1724837 B/op	     768 allocs/op

Needs more testing, in particular for recovery workflow.

Results from basic benchmarks: ``` $ go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecacheSeq/bitcask_seq 38026 1864616 ns/op 49900 B/op 77 allocs/op BenchmarkWritecacheSeq/bbolt_seq 6319 11320495 ns/op 93123 B/op 167 allocs/op BenchmarkWritecacheSeq/badger_seq 12699 5128271 ns/op 17288137 B/op 2175 allocs/op $ go test -benchmem -run=^$ -benchtime 1m -cpu 4 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-4 135234 553496 ns/op 47069 B/op 74 allocs/op BenchmarkWritecachePar/bbolt_par-4 24415 2947224 ns/op 91298 B/op 149 allocs/op BenchmarkWritecachePar/badger_par-4 10000 6544963 ns/op 32912089 B/op 2112 allocs/op $ go test -benchmem -run=^$ -benchtime 1m -cpu 8 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-8 281928 266495 ns/op 45196 B/op 72 allocs/op BenchmarkWritecachePar/bbolt_par-8 46563 1524095 ns/op 90591 B/op 147 allocs/op BenchmarkWritecachePar/badger_par-8 17053 4623445 ns/op 29140321 B/op 3067 allocs/op $ go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-32 1000000 71073 ns/op 46933 B/op 72 allocs/op BenchmarkWritecachePar/bbolt_par-32 153822 506729 ns/op 82818 B/op 141 allocs/op BenchmarkWritecachePar/badger_par-32 83230 1073903 ns/op 1724837 B/op 768 allocs/op ``` Needs more testing, in particular for recovery workflow.
ale64bit added the
discussion
label 2023-08-31 10:30:17 +00:00
aarifullin reviewed 2023-08-31 11:12:16 +00:00
@ -0,0 +180,4 @@
// Read payload size
var sizeBytes [4]byte
Member

Can u please make some constant and assign it to 4?

var sizeBytes [4]byte
...
return addr, obj, keyLen + 4 + len(data), nil
Can u please make some constant and assign it to `4`? ```golang var sizeBytes [4]byte ... return addr, obj, keyLen + 4 + len(data), nil ```
Author
Member

done

done
ale64bit force-pushed feature/wc-experiment from f8583060ec to 42e74d6aab 2023-08-31 11:17:26 +00:00 Compare
Owner

Speaking of benchmarks -- sequential benchmarks are affected by the batch delay and batch size. Have you made batch_size=1 for bbolt?
Parallel benchmarks need to be executed in similar settings -- is that true?

Speaking of benchmarks -- sequential benchmarks are affected by the batch delay and batch size. Have you made batch_size=1 for bbolt? Parallel benchmarks need to be executed in similar settings -- is that true?
Author
Member

Speaking of benchmarks -- sequential benchmarks are affected by the batch delay and batch size. Have you made batch_size=1 for bbolt?
Parallel benchmarks need to be executed in similar settings -- is that true?

Well, I tried to use the default values as much as possible, since those should be the ones representative of actual performance. They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload.

Sequential and parallel results after setting MaxBatchDelay to 1ms for bbolt (in this case, this matches the delay used by bitcask):

go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecacheSeq/bitcask_seq         	   36579	   2001657 ns/op	   51086 B/op	      79 allocs/op
BenchmarkWritecacheSeq/bbolt_seq           	   35227	   2060927 ns/op	  103768 B/op	     186 allocs/op
BenchmarkWritecacheSeq/badger_seq          	   13095	   5262366 ns/op	24072132 B/op	    2529 allocs/op

go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-32         	 1000000	     70398 ns/op	   46931 B/op	      72 allocs/op
BenchmarkWritecachePar/bbolt_par-32           	  353757	    274510 ns/op	   96260 B/op	     161 allocs/op
BenchmarkWritecachePar/badger_par-32          	   82377	   1102501 ns/op	 1751070 B/op	     756 allocs/op

Sequential and parallel results after setting MaxBatchSize to 1 for bbolt:

go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecacheSeq/bitcask_seq         	   35458	   2020180 ns/op	   52071 B/op	      80 allocs/op
BenchmarkWritecacheSeq/bbolt_seq           	   58494	   1326985 ns/op	  220290 B/op	     258 allocs/op
BenchmarkWritecacheSeq/badger_seq          	   13074	   5148205 ns/op	24314056 B/op	    2853 allocs/op

go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-32         	 1000000	     72101 ns/op	   46910 B/op	      72 allocs/op
BenchmarkWritecachePar/bbolt_par-32           	   66728	   1159092 ns/op	  150257 B/op	     262 allocs/op
BenchmarkWritecachePar/badger_par-32          	   81734	   1065923 ns/op	 1767468 B/op	     766 allocs/op

Of course, we can set the MaxBatchDelay to 0 in bitcask so that it flushes immediately during the sequential benchmark, just to say that it beats bbolt:

go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecacheSeq/bitcask_seq         	  130424	    583808 ns/op	   48586 B/op	      76 allocs/op
BenchmarkWritecacheSeq/bbolt_seq           	   53733	   1327371 ns/op	  166602 B/op	     255 allocs/op
BenchmarkWritecacheSeq/badger_seq          	   12619	   4932394 ns/op	21091385 B/op	    2584 allocs/op

but IMHO, the latency and throughput during parallel workloads are more important, as long as the difference during sequential workloads is not too large.

In any case, benchmarks on actual hardware and interacting with the other components are more meaningful.

> Speaking of benchmarks -- sequential benchmarks are affected by the batch delay and batch size. Have you made batch_size=1 for bbolt? > Parallel benchmarks need to be executed in similar settings -- is that true? Well, I tried to use the default values as much as possible, since those should be the ones representative of actual performance. They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload. Sequential and parallel results after setting `MaxBatchDelay` to `1ms` for `bbolt` (in this case, this matches the delay used by bitcask): ``` go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecacheSeq/bitcask_seq 36579 2001657 ns/op 51086 B/op 79 allocs/op BenchmarkWritecacheSeq/bbolt_seq 35227 2060927 ns/op 103768 B/op 186 allocs/op BenchmarkWritecacheSeq/badger_seq 13095 5262366 ns/op 24072132 B/op 2529 allocs/op go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-32 1000000 70398 ns/op 46931 B/op 72 allocs/op BenchmarkWritecachePar/bbolt_par-32 353757 274510 ns/op 96260 B/op 161 allocs/op BenchmarkWritecachePar/badger_par-32 82377 1102501 ns/op 1751070 B/op 756 allocs/op ``` Sequential and parallel results after setting `MaxBatchSize` to `1` for `bbolt`: ``` go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecacheSeq/bitcask_seq 35458 2020180 ns/op 52071 B/op 80 allocs/op BenchmarkWritecacheSeq/bbolt_seq 58494 1326985 ns/op 220290 B/op 258 allocs/op BenchmarkWritecacheSeq/badger_seq 13074 5148205 ns/op 24314056 B/op 2853 allocs/op go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-32 1000000 72101 ns/op 46910 B/op 72 allocs/op BenchmarkWritecachePar/bbolt_par-32 66728 1159092 ns/op 150257 B/op 262 allocs/op BenchmarkWritecachePar/badger_par-32 81734 1065923 ns/op 1767468 B/op 766 allocs/op ``` Of course, we can set the `MaxBatchDelay` to 0 in bitcask so that it flushes immediately during the sequential benchmark, just to say that it beats bbolt: ``` go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecacheSeq/bitcask_seq 130424 583808 ns/op 48586 B/op 76 allocs/op BenchmarkWritecacheSeq/bbolt_seq 53733 1327371 ns/op 166602 B/op 255 allocs/op BenchmarkWritecacheSeq/badger_seq 12619 4932394 ns/op 21091385 B/op 2584 allocs/op ``` but IMHO, the latency and throughput during parallel workloads are more important, as long as the difference during sequential workloads is not too large. In any case, benchmarks on actual hardware and interacting with the other components are more meaningful.
Owner

They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload.

This is not about a specific workload, this is about apples-to-apples comparison. We may switch implementation or we may just switch a parameter.

but IMHO, the latency and throughput during parallel workloads are more important

I agree, given that we consume less memory and do not longer panic if the media is removed, new implementation looks promising.

In any case, benchmarks on actual hardware and interacting with the other components are more meaningful.

We have local k6 loader -- it could help -- we have flushes there which interfere with the PUT.

>They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload. This is not about a specific workload, this is about apples-to-apples comparison. We may switch implementation or we may just switch a parameter. >but IMHO, the latency and throughput during parallel workloads are more important I agree, given that we consume less memory and do not longer panic if the media is removed, new implementation looks promising. >In any case, benchmarks on actual hardware and interacting with the other components are more meaningful. We have local k6 loader -- it could help -- we have flushes there which interfere with the PUT.
Author
Member

They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload.

This is not about a specific workload, this is about apples-to-apples comparison. We may switch implementation or we may just switch a parameter.

OK.

but IMHO, the latency and throughput during parallel workloads are more important

I agree, given that we consume less memory and do not longer panic if the media is removed, new implementation looks promising.

In any case, benchmarks on actual hardware and interacting with the other components are more meaningful.

We have local k6 loader -- it could help -- we have flushes there which interfere with the PUT.

Not sure I follow. Flushing is enabled for these benchmarks as well. The only difference is that the backing store and metabase are mocks, so that the benchmarks actually measure the writecache and not other things.

> >They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload. > > This is not about a specific workload, this is about apples-to-apples comparison. We may switch implementation or we may just switch a parameter. > OK. > >but IMHO, the latency and throughput during parallel workloads are more important > > I agree, given that we consume less memory and do not longer panic if the media is removed, new implementation looks promising. > > >In any case, benchmarks on actual hardware and interacting with the other components are more meaningful. > > We have local k6 loader -- it could help -- we have flushes there which interfere with the PUT. Not sure I follow. Flushing is enabled for these benchmarks as well. The only difference is that the backing store and metabase are mocks, so that the benchmarks actually measure the writecache and not other things.

As far as I understand, the current implementation stores all the keys in memory.
Assume we have 12 shards, each writecache shard has a maximum size of 256GB. Then for object size of 1KB we must to store in memory 256GB * 12 / 1024 = 3221225472 keys. If each key has size of 64 bytes, then we must have 196GB of RAM in the system?
So i think this kind of writecache must have not only size limit, but key limit too.

As far as I understand, the current implementation stores all the keys in memory. Assume we have 12 shards, each writecache shard has a maximum size of 256GB. Then for object size of 1KB we must to store in memory 256GB * 12 / 1024 = 3221225472 keys. If each key has size of 64 bytes, then we must have 196GB of RAM in the system? So i think this kind of writecache must have not only size limit, but key limit too.
dstepanov-yadro requested changes 2023-09-01 12:35:16 +00:00
@ -0,0 +58,4 @@
type cache struct {
options
mode atomic.Uint32

We have to add at least metrics to compare performance of the different writecache implementation.

We have to add at least metrics to compare performance of the different writecache implementation.
fyrchik changed title from Experimental bitcask-based writecache implementation to WIP: Experimental bitcask-based writecache implementation 2023-09-11 09:40:10 +00:00
Owner

Making it WIP until @ale64bit is back from vacation.

Making it WIP until @ale64bit is back from vacation.
fyrchik closed this pull request 2024-04-27 10:56:39 +00:00
Some checks failed
DCO action / DCO (pull_request) Successful in 3m8s
Required
Details
Vulncheck / Vulncheck (pull_request) Successful in 3m16s
Required
Details
Build / Build Components (1.20) (pull_request) Successful in 4m13s
Required
Details
Build / Build Components (1.21) (pull_request) Successful in 4m16s
Required
Details
Tests and linters / Staticcheck (pull_request) Successful in 5m11s
Required
Details
Tests and linters / Lint (pull_request) Successful in 5m58s
Required
Details
Tests and linters / Tests with -race (pull_request) Failing after 6m3s
Required
Details
Tests and linters / Tests (1.20) (pull_request) Successful in 7m29s
Required
Details
Tests and linters / Tests (1.21) (pull_request) Successful in 7m38s
Required
Details

Pull request closed

Sign in to join this conversation.
No reviewers
TrueCloudLab/storage-core-developers
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#654
No description provided.