WIP: Experimental bitcask-based writecache implementation #654

ale64bit · 2023-08-25T13:08:59Z

ale64bit commented

2023-08-25 13:08:59 +00:00

Signed-off-by: Alejandro Lopez a.lopez@yadro.com

Signed-off-by: Alejandro Lopez <a.lopez@yadro.com>

🎉 1

ale64bit force-pushed feature/wc-experiment from 0f4b63a027 to 83a05579b3

2023-08-28 11:11:21 +00:00

Compare

ale64bit force-pushed feature/wc-experiment from 83a05579b3 to 9aa44a1790

2023-08-28 11:30:11 +00:00

Compare

ale64bit force-pushed feature/wc-experiment from 9aa44a1790 to 30fc6b56bd

2023-08-30 11:32:07 +00:00

Compare

ale64bit force-pushed feature/wc-experiment from 30fc6b56bd to cf50a6b9c6

2023-08-31 09:29:57 +00:00

Compare

ale64bit force-pushed feature/wc-experiment from cf50a6b9c6 to f8583060ec

2023-08-31 10:28:19 +00:00

Compare

ale64bit changed title from ~~WIP: [#xx] Experimental lite writecache implementation~~ to Experimental bitcask-based writecache implementation

2023-08-31 10:29:03 +00:00

requested reviews from storage-core-committers, storage-core-developers

2023-08-31 10:29:18 +00:00

ale64bit commented

2023-08-31 10:30:02 +00:00

Results from basic benchmarks:

$ go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecacheSeq/bitcask_seq         	   38026	   1864616 ns/op	   49900 B/op	      77 allocs/op
BenchmarkWritecacheSeq/bbolt_seq           	    6319	  11320495 ns/op	   93123 B/op	     167 allocs/op
BenchmarkWritecacheSeq/badger_seq          	   12699	   5128271 ns/op	17288137 B/op	    2175 allocs/op

$ go test -benchmem -run=^$ -benchtime 1m -cpu 4 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-4         	  135234	    553496 ns/op	   47069 B/op	      74 allocs/op
BenchmarkWritecachePar/bbolt_par-4           	   24415	   2947224 ns/op	   91298 B/op	     149 allocs/op
BenchmarkWritecachePar/badger_par-4          	   10000	   6544963 ns/op	32912089 B/op	    2112 allocs/op

$ go test -benchmem -run=^$ -benchtime 1m -cpu 8 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-8         	  281928	    266495 ns/op	   45196 B/op	      72 allocs/op
BenchmarkWritecachePar/bbolt_par-8           	   46563	   1524095 ns/op	   90591 B/op	     147 allocs/op
BenchmarkWritecachePar/badger_par-8          	   17053	   4623445 ns/op	29140321 B/op	    3067 allocs/op

$ go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-32         	 1000000	     71073 ns/op	   46933 B/op	      72 allocs/op
BenchmarkWritecachePar/bbolt_par-32           	  153822	    506729 ns/op	   82818 B/op	     141 allocs/op
BenchmarkWritecachePar/badger_par-32          	   83230	   1073903 ns/op	 1724837 B/op	     768 allocs/op

Needs more testing, in particular for recovery workflow.

Results from basic benchmarks: ``` $ go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecacheSeq/bitcask_seq 38026 1864616 ns/op 49900 B/op 77 allocs/op BenchmarkWritecacheSeq/bbolt_seq 6319 11320495 ns/op 93123 B/op 167 allocs/op BenchmarkWritecacheSeq/badger_seq 12699 5128271 ns/op 17288137 B/op 2175 allocs/op $ go test -benchmem -run=^$ -benchtime 1m -cpu 4 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-4 135234 553496 ns/op 47069 B/op 74 allocs/op BenchmarkWritecachePar/bbolt_par-4 24415 2947224 ns/op 91298 B/op 149 allocs/op BenchmarkWritecachePar/badger_par-4 10000 6544963 ns/op 32912089 B/op 2112 allocs/op $ go test -benchmem -run=^$ -benchtime 1m -cpu 8 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-8 281928 266495 ns/op 45196 B/op 72 allocs/op BenchmarkWritecachePar/bbolt_par-8 46563 1524095 ns/op 90591 B/op 147 allocs/op BenchmarkWritecachePar/badger_par-8 17053 4623445 ns/op 29140321 B/op 3067 allocs/op $ go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-32 1000000 71073 ns/op 46933 B/op 72 allocs/op BenchmarkWritecachePar/bbolt_par-32 153822 506729 ns/op 82818 B/op 141 allocs/op BenchmarkWritecachePar/badger_par-32 83230 1073903 ns/op 1724837 B/op 768 allocs/op ``` Needs more testing, in particular for recovery workflow.

ale64bit added the

discussion

label 2023-08-31 10:30:17 +00:00

aarifullin reviewed 2023-08-31 11:12:16 +00:00

pkg/local_object_storage/writecache/writecachebitcask/flush.go Outdated

					
				@ -0,0 +180,4 @@

					// Read payload size

					var sizeBytes [4]byte

aarifullin commented

2023-08-31 11:12:16 +00:00

Can u please make some constant and assign it to 4?

var sizeBytes [4]byte
...
return addr, obj, keyLen + 4 + len(data), nil

Can u please make some constant and assign it to `4`? ```golang var sizeBytes [4]byte ... return addr, obj, keyLen + 4 + len(data), nil ```

ale64bit commented

2023-08-31 11:17:29 +00:00

done

ale64bit force-pushed feature/wc-experiment from f8583060ec to 42e74d6aab

2023-08-31 11:17:26 +00:00

Compare

fyrchik commented

2023-08-31 12:11:33 +00:00

Speaking of benchmarks -- sequential benchmarks are affected by the batch delay and batch size. Have you made batch_size=1 for bbolt?
Parallel benchmarks need to be executed in similar settings -- is that true?

Speaking of benchmarks -- sequential benchmarks are affected by the batch delay and batch size. Have you made batch_size=1 for bbolt? Parallel benchmarks need to be executed in similar settings -- is that true?

ale64bit commented

2023-08-31 12:54:14 +00:00

Speaking of benchmarks -- sequential benchmarks are affected by the batch delay and batch size. Have you made batch_size=1 for bbolt?
Parallel benchmarks need to be executed in similar settings -- is that true?

Well, I tried to use the default values as much as possible, since those should be the ones representative of actual performance. They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload.

Sequential and parallel results after setting MaxBatchDelay to 1ms for bbolt (in this case, this matches the delay used by bitcask):

go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecacheSeq/bitcask_seq         	   36579	   2001657 ns/op	   51086 B/op	      79 allocs/op
BenchmarkWritecacheSeq/bbolt_seq           	   35227	   2060927 ns/op	  103768 B/op	     186 allocs/op
BenchmarkWritecacheSeq/badger_seq          	   13095	   5262366 ns/op	24072132 B/op	    2529 allocs/op

go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-32         	 1000000	     70398 ns/op	   46931 B/op	      72 allocs/op
BenchmarkWritecachePar/bbolt_par-32           	  353757	    274510 ns/op	   96260 B/op	     161 allocs/op
BenchmarkWritecachePar/badger_par-32          	   82377	   1102501 ns/op	 1751070 B/op	     756 allocs/op

Sequential and parallel results after setting MaxBatchSize to 1 for bbolt:

go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecacheSeq/bitcask_seq         	   35458	   2020180 ns/op	   52071 B/op	      80 allocs/op
BenchmarkWritecacheSeq/bbolt_seq           	   58494	   1326985 ns/op	  220290 B/op	     258 allocs/op
BenchmarkWritecacheSeq/badger_seq          	   13074	   5148205 ns/op	24314056 B/op	    2853 allocs/op

go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecachePar/bitcask_par-32         	 1000000	     72101 ns/op	   46910 B/op	      72 allocs/op
BenchmarkWritecachePar/bbolt_par-32           	   66728	   1159092 ns/op	  150257 B/op	     262 allocs/op
BenchmarkWritecachePar/badger_par-32          	   81734	   1065923 ns/op	 1767468 B/op	     766 allocs/op

Of course, we can set the MaxBatchDelay to 0 in bitcask so that it flushes immediately during the sequential benchmark, just to say that it beats bbolt:

go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark
BenchmarkWritecacheSeq/bitcask_seq         	  130424	    583808 ns/op	   48586 B/op	      76 allocs/op
BenchmarkWritecacheSeq/bbolt_seq           	   53733	   1327371 ns/op	  166602 B/op	     255 allocs/op
BenchmarkWritecacheSeq/badger_seq          	   12619	   4932394 ns/op	21091385 B/op	    2584 allocs/op

but IMHO, the latency and throughput during parallel workloads are more important, as long as the difference during sequential workloads is not too large.

In any case, benchmarks on actual hardware and interacting with the other components are more meaningful.

> Speaking of benchmarks -- sequential benchmarks are affected by the batch delay and batch size. Have you made batch_size=1 for bbolt? > Parallel benchmarks need to be executed in similar settings -- is that true? Well, I tried to use the default values as much as possible, since those should be the ones representative of actual performance. They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload. Sequential and parallel results after setting `MaxBatchDelay` to `1ms` for `bbolt` (in this case, this matches the delay used by bitcask): ``` go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecacheSeq/bitcask_seq 36579 2001657 ns/op 51086 B/op 79 allocs/op BenchmarkWritecacheSeq/bbolt_seq 35227 2060927 ns/op 103768 B/op 186 allocs/op BenchmarkWritecacheSeq/badger_seq 13095 5262366 ns/op 24072132 B/op 2529 allocs/op go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-32 1000000 70398 ns/op 46931 B/op 72 allocs/op BenchmarkWritecachePar/bbolt_par-32 353757 274510 ns/op 96260 B/op 161 allocs/op BenchmarkWritecachePar/badger_par-32 82377 1102501 ns/op 1751070 B/op 756 allocs/op ``` Sequential and parallel results after setting `MaxBatchSize` to `1` for `bbolt`: ``` go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecacheSeq/bitcask_seq 35458 2020180 ns/op 52071 B/op 80 allocs/op BenchmarkWritecacheSeq/bbolt_seq 58494 1326985 ns/op 220290 B/op 258 allocs/op BenchmarkWritecacheSeq/badger_seq 13074 5148205 ns/op 24314056 B/op 2853 allocs/op go test -benchmem -run=^$ -benchtime 1m -cpu 32 -timeout 1h -bench ^BenchmarkWritecachePar$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecachePar/bitcask_par-32 1000000 72101 ns/op 46910 B/op 72 allocs/op BenchmarkWritecachePar/bbolt_par-32 66728 1159092 ns/op 150257 B/op 262 allocs/op BenchmarkWritecachePar/badger_par-32 81734 1065923 ns/op 1767468 B/op 766 allocs/op ``` Of course, we can set the `MaxBatchDelay` to 0 in bitcask so that it flushes immediately during the sequential benchmark, just to say that it beats bbolt: ``` go test -benchmem -run=^$ -benchtime 1m -cpu 1 -timeout 1h -bench ^BenchmarkWritecacheSeq$ git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/writecache/benchmark BenchmarkWritecacheSeq/bitcask_seq 130424 583808 ns/op 48586 B/op 76 allocs/op BenchmarkWritecacheSeq/bbolt_seq 53733 1327371 ns/op 166602 B/op 255 allocs/op BenchmarkWritecacheSeq/badger_seq 12619 4932394 ns/op 21091385 B/op 2584 allocs/op ``` but IMHO, the latency and throughput during parallel workloads are more important, as long as the difference during sequential workloads is not too large. In any case, benchmarks on actual hardware and interacting with the other components are more meaningful.

fyrchik commented

2023-08-31 15:27:07 +00:00

They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload.

This is not about a specific workload, this is about apples-to-apples comparison. We may switch implementation or we may just switch a parameter.

but IMHO, the latency and throughput during parallel workloads are more important

I agree, given that we consume less memory and do not longer panic if the media is removed, new implementation looks promising.

In any case, benchmarks on actual hardware and interacting with the other components are more meaningful.

We have local k6 loader -- it could help -- we have flushes there which interfere with the PUT.

>They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload. This is not about a specific workload, this is about apples-to-apples comparison. We may switch implementation or we may just switch a parameter. >but IMHO, the latency and throughput during parallel workloads are more important I agree, given that we consume less memory and do not longer panic if the media is removed, new implementation looks promising. >In any case, benchmarks on actual hardware and interacting with the other components are more meaningful. We have local k6 loader -- it could help -- we have flushes there which interfere with the PUT.

ale64bit commented

2023-09-01 07:39:47 +00:00

They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload.

This is not about a specific workload, this is about apples-to-apples comparison. We may switch implementation or we may just switch a parameter.

OK.

but IMHO, the latency and throughput during parallel workloads are more important

I agree, given that we consume less memory and do not longer panic if the media is removed, new implementation looks promising.

In any case, benchmarks on actual hardware and interacting with the other components are more meaningful.

We have local k6 loader -- it could help -- we have flushes there which interfere with the PUT.

Not sure I follow. Flushing is enabled for these benchmarks as well. The only difference is that the backing store and metabase are mocks, so that the benchmarks actually measure the writecache and not other things.

> >They should be executed in settings that are representative of how they will be used; we can't arbitrarily change parameters in a production deployment to match a specific workload. > > This is not about a specific workload, this is about apples-to-apples comparison. We may switch implementation or we may just switch a parameter. > OK. > >but IMHO, the latency and throughput during parallel workloads are more important > > I agree, given that we consume less memory and do not longer panic if the media is removed, new implementation looks promising. > > >In any case, benchmarks on actual hardware and interacting with the other components are more meaningful. > > We have local k6 loader -- it could help -- we have flushes there which interfere with the PUT. Not sure I follow. Flushing is enabled for these benchmarks as well. The only difference is that the backing store and metabase are mocks, so that the benchmarks actually measure the writecache and not other things.

👍 1

dstepanov-yadro commented

2023-09-01 12:28:22 +00:00

As far as I understand, the current implementation stores all the keys in memory.
Assume we have 12 shards, each writecache shard has a maximum size of 256GB. Then for object size of 1KB we must to store in memory 256GB * 12 / 1024 = 3221225472 keys. If each key has size of 64 bytes, then we must have 196GB of RAM in the system?
So i think this kind of writecache must have not only size limit, but key limit too.

As far as I understand, the current implementation stores all the keys in memory. Assume we have 12 shards, each writecache shard has a maximum size of 256GB. Then for object size of 1KB we must to store in memory 256GB * 12 / 1024 = 3221225472 keys. If each key has size of 64 bytes, then we must have 196GB of RAM in the system? So i think this kind of writecache must have not only size limit, but key limit too.

dstepanov-yadro requested changes 2023-09-01 12:35:16 +00:00

pkg/local_object_storage/writecache/writecachebitcask/writecachebitcask.go

					
				@ -0,0 +58,4 @@

				type cache struct {

					options

					mode    atomic.Uint32

dstepanov-yadro commented

2023-09-01 12:35:14 +00:00

We have to add at least metrics to compare performance of the different writecache implementation.

fyrchik changed title from ~~Experimental bitcask-based writecache implementation~~ to WIP: Experimental bitcask-based writecache implementation

2023-09-11 09:40:10 +00:00

fyrchik commented

2023-09-11 09:40:29 +00:00

Making it WIP until @ale64bit is back from vacation.

fyrchik referenced this pull request

2024-01-16 12:02:02 +00:00

Try using badger for the write-cache #421

fyrchik closed this pull request

2024-04-27 10:56:39 +00:00