Measure tombstone placement duration depending on tombstone size #1450

Closed
opened 2024-10-25 11:46:00 +00:00 by alexvanin · 7 comments
Owner

Tombstone object contains a set of object addresses that should be marked as removed. When tombstone object is processed by object service, it inhumes these addresses in the metabase. The more object tombstone has, the longer it is processed in object service.

It would be nice to see how long it takes to perform object.Put operation with different tombstone sizes, e.g. 1, 10, 1000 and 10 000 addresses. If it takes quite a while, maybe there is a room for optimizations to bulk inhumes or do something else.

Context

S3 Gateways supports multipart uploads: when single S3 object is presented as a bunch of FrostFS objects. The limit is up to 10 000 parts.

Right now S3 Gateway removes multipart object inefficiently: it calls object.Delete for every part sequentially.

The alternative is to collect one big tombstone object and store it. However @mbiryukova noticed that object.Put operation for tombstone may take quite some time. Calculations may be inaccurate, but in the virtual environment tombstone of 1024 addresses took almost a minute (50s). It would be nice to check it independently in this issue.

Tombstone object contains a set of object addresses that should be marked as removed. When tombstone object is processed by object service, it inhumes these addresses in the metabase. The more object tombstone has, the longer it is processed in object service. It would be nice to see how long it takes to perform `object.Put` operation with different tombstone sizes, e.g. 1, 10, 1000 and 10 000 addresses. If it takes quite a while, maybe there is a room for optimizations to bulk inhumes or do something else. ## Context S3 Gateways supports multipart uploads: when single S3 object is presented as a bunch of FrostFS objects. The limit is up to 10 000 parts. Right now S3 Gateway removes multipart object inefficiently: it calls `object.Delete` for every part sequentially. The alternative is to collect one big tombstone object and store it. However @mbiryukova noticed that `object.Put` operation for tombstone may take quite some time. Calculations may be inaccurate, but in the virtual environment tombstone of 1024 addresses took almost a minute (50s). It would be nice to check it independently in this issue.
alexvanin added the
triage
frostfs-node
labels 2024-10-25 11:46:15 +00:00
a-savchuk was assigned by fyrchik 2024-10-25 12:05:53 +00:00
fyrchik removed the
triage
label 2024-10-25 12:07:06 +00:00
Member

I've decided to write a benchmark for the (*StorageEngine).Inhume method. I ran it for 1, 10, 100, 1.000, and 10.000 addresses in a tombstone, and I collected a CPU profile and memory profile for a run with 10.000 addresses. I attached flame charts for each profile and highlighted most interesting places. Feel free to criticize my methodology and share your thoughts, as those could help improve it.

Source code link

Benchmark results

Each storage engine had 100 shards. I tried to investigate whether the number of shards affected performance, but the impact was insignificant.

goos: linux
goarch: amd64
pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/engine
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
BenchmarkInhumeMultipart/1_parts-12                  100            11233450 ns/op
BenchmarkInhumeMultipart/10_parts-12                  10           113034119 ns/op
BenchmarkInhumeMultipart/100_parts-12                 10          1124219185 ns/op
BenchmarkInhumeMultipart/1000_parts-12                10         11261645029 ns/op
BenchmarkInhumeMultipart/10000_parts-12                5        112014979599 ns/op
num. addresses ms
1 11
10 113
100 1 124
1.000 11 262
10.000 112 015

CPU profile (for 10.000 parts only)

Data: cpu_10000_parts.pb.gz

1. An entire chart for the benchmark

cpu_everything.png

2. Zoom on (*StorageEngine).Inhume

cpu_Engine_Inhume.png

3. Zoom on (*StorageEngine).IsLocked

cpu_Engine_IsLocked.png

4. Zoom on (*StorageEngine).inhumeAddr

cpu_Engine_inhumeAddr.png

Memory profile (for 10.000 parts only)

Data: mem_10000_parts.pb.gz

1. An entire chart for the benchmark

mem_everything.png

2. Zoom on (*StorageEngine).Inhume

mem_Engine_Inhume.png

3. Zoom on (*StorageEngine).IsLocked

mem_Engine_IsLocked.png

4. Zoom on (*StorageEngine).inhumeAddr

mem_Engine_inhumeAddr.png

What I can see right now

Most of time we check whether some of objects, that'll be inhumed, are locked (that chart). It's mostly because we convert each address to string when starting trace span (that chart), and we do it several times, in the (*Shard).IsLocked and (*DB).IsLocked.

ctx, span := tracing.StartSpanFromContext(ctx, "Shard.IsLocked",
trace.WithAttributes(
attribute.String("shard_id", s.ID().String()),
attribute.String("address", addr.EncodeToString()),
))


_, span := tracing.StartSpanFromContext(ctx, "metabase.IsLocked",
trace.WithAttributes(
attribute.String("address", prm.addr.EncodeToString()),
))


While we don't do that in the (*DB).Inhume.

_, span := tracing.StartSpanFromContext(ctx, "metabase.Inhume")

I've decided to write a benchmark for the `(*StorageEngine).Inhume` method. I ran it for 1, 10, 100, 1.000, and 10.000 addresses in a tombstone, and I collected a CPU profile and memory profile for a run with 10.000 addresses. I attached flame charts for each profile and highlighted most interesting places. Feel free to criticize my methodology and share your thoughts, as those could help improve it. Source code [link](https://git.frostfs.info/a-savchuk/frostfs-node/src/commit/87e909343f1cf56cb6f4670a9b6383e0e979e674/pkg/local_object_storage/engine/inhume_test.go#L92-L137) ### Benchmark results Each storage engine had 100 shards. I tried to investigate whether the number of shards affected performance, but the impact was insignificant. ``` goos: linux goarch: amd64 pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/engine cpu: 12th Gen Intel(R) Core(TM) i5-1235U BenchmarkInhumeMultipart/1_parts-12 100 11233450 ns/op BenchmarkInhumeMultipart/10_parts-12 10 113034119 ns/op BenchmarkInhumeMultipart/100_parts-12 10 1124219185 ns/op BenchmarkInhumeMultipart/1000_parts-12 10 11261645029 ns/op BenchmarkInhumeMultipart/10000_parts-12 5 112014979599 ns/op ``` |num. addresses|ms| |---:|---:| |1|11| |10|113| |100|1 124| |1.000|11 262| |10.000|112 015| ### CPU profile (for 10.000 parts only) Data: [cpu_10000_parts.pb.gz](/attachments/42d21509-f8da-484b-93dd-ba1e69d980af) #### <a name="chart-cpu-everything"></a> 1. An entire chart for the benchmark ![cpu_everything.png](/attachments/c386aa61-1ebd-4a19-a1cc-3853db53e7eb) #### <a name="chart-cpu-Engine-Inhume"></a> 2. Zoom on [`(*StorageEngine).Inhume`](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/engine/inhume.go#L70) ![cpu_Engine_Inhume.png](/attachments/2c808e05-3548-4325-b11a-9bf69d9b84a3) #### <a name="chart-cpu-Engine-IsLocked"></a> 3. Zoom on [`(*StorageEngine).IsLocked`](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/engine/inhume.go#L192) ![cpu_Engine_IsLocked.png](/attachments/6bfc565b-598a-4a16-8283-ac77b1182bb4) #### <a name="chart-cpu-Engine-inhumeAddr"></a> 4. Zoom on [`(*StorageEngine).inhumeAddr`](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/engine/inhume.go#L127) ![cpu_Engine_inhumeAddr.png](/attachments/ee792eb6-fcdd-462d-8c6e-5db8b01d41da) ### Memory profile (for 10.000 parts only) Data: [mem_10000_parts.pb.gz](/attachments/dcb4f353-6b94-448a-b260-acde4100b8b5) #### <a name="chart-mem-everything"></a> 1. An entire chart for the benchmark ![mem_everything.png](/attachments/2d7cf5fb-6bc0-410c-b347-ee6eb09555f8) #### <a name="chart-mem-Engine-Inhume"></a> 2. Zoom on [`(*StorageEngine).Inhume`](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/engine/inhume.go#L70) ![mem_Engine_Inhume.png](/attachments/01c9ff27-bd4c-40fc-840b-043058cf9528) #### <a name="chart-mem-Engine-IsLocked"></a> 3. Zoom on [`(*StorageEngine).IsLocked`](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/engine/inhume.go#L192) ![mem_Engine_IsLocked.png](/attachments/f936e6eb-183a-45d7-b2b0-34597ff107ad) #### <a name="chart-mem-Engine-inhumeAddr"></a> 4. Zoom on [`(*StorageEngine).inhumeAddr`](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/engine/inhume.go#L127) ![mem_Engine_inhumeAddr.png](/attachments/f3682d00-bbb3-441a-8731-4a6932911238) ### What I can see right now Most of time we check whether some of objects, that'll be inhumed, are locked ([that chart](#chart-cpu-Engine-Inhume)). It's mostly because we convert each address to string when starting trace span ([that chart](#chart-cpu-Engine-IsLocked)), and we do it several times, in the `(*Shard).IsLocked` and `(*DB).IsLocked`. https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/shard/lock.go#L52-L56 https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/metabase/lock.go#L334-L337 While we don't do that in the `(*DB).Inhume`. https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/metabase/inhume.go#L182
Member

I collected an off-CPU profile as I was advised by @dstepanov-yadro *thanks*. Here's what I found

Off-CPU profile (for 10.000 parts only)

Data: off_cpu_10000_parts.pb.gz

1. Zoom on (*StorageEngine).Inhume

off_cpu_Engine_Inhume.png

2. Zoom on (*DB).Inhume

off_cpu_Metabase_Inhume.png

My observations

According to this chart, the time distribution in (*StorageEngine).Inhume has changed significantly compared to the previous chart.

Let's take a look at this chart. We ran the benchmark 5 times with 10.000 addresses to inhume. If we call bbolt.(*DB).Batch in a loop with the default bbolt batch delay of 10 millisec, the expected time would be about 5 * 10.000 * 0,01 sec = 500 sec. According to the chart, we spent about 0.15 hrs = 540 sec in (*DB).Inhume, which indicates that we're using bbolt.(*DB).Batch inefficiently in this case.

I collected an off-CPU profile as I was advised by @dstepanov-yadro \*thanks\*. Here's what I found ### Off-CPU profile (for 10.000 parts only) Data: [off_cpu_10000_parts.pb.gz](/attachments/68a7c766-a0dc-4f33-bc82-663ad081da8d) #### <a name="chart-off-cpu-Engine-Inhume"></a> 1. Zoom on [`(*StorageEngine).Inhume`](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/engine/inhume.go#L70) ![off_cpu_Engine_Inhume.png](/attachments/fdfa8871-1c0f-480f-9f50-b33dc73aea00) #### <a name="chart-off-cpu-Metabase-Inhume"></a> 2. Zoom on [`(*DB).Inhume`](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/d19ab43500c24795b9511f0644b706a010983a5a/pkg/local_object_storage/metabase/inhume.go#L174) ![off_cpu_Metabase_Inhume.png](/attachments/72c565bf-1797-47c9-bd3e-6438b1abefc9) ### My observations According to [this chart](#chart-off-cpu-Engine-Inhume), the time distribution in `(*StorageEngine).Inhume` has changed significantly compared to [the previous chart](#chart-cpu-Engine-Inhume). Let's take a look at [this chart](#chart-off-cpu-Metabase-Inhume). We ran the benchmark 5 times with 10.000 addresses to inhume. If we call `bbolt.(*DB).Batch` in a loop with the default bbolt batch delay of 10 millisec, the expected time would be about 5 * 10.000 * 0,01 sec = 500 sec. According to the chart, we spent about 0.15 hrs = 540 sec in `(*DB).Inhume`, which indicates that we're using `bbolt.(*DB).Batch` inefficiently in this case.
Owner

which indicates that we're using bbolt.(*DB).Batch inefficiently in this case.

Not necessarily. In isolation, yes, we do. In reality there are concurrent queries coming from user, so it is not obvious, whether Batch is used inefficiently.

>which indicates that we're using bbolt.(*DB).Batch inefficiently in this case. Not necessarily. In isolation, yes, we do. In reality there are concurrent queries coming from user, so it is not obvious, whether `Batch` is used inefficiently.
Owner

@a-savchuk Could you share benchmark setup and code?

@a-savchuk Could you share benchmark setup and code?
Member

@a-savchuk Could you share benchmark setup and code?

I added them in the first comment

> @a-savchuk Could you share benchmark setup and code? I added them in [the first comment](https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/1450#issuecomment-56634)
Member

Two solutions were proposed:

  1. Using a worker pool to process objects in parallel.
  2. Grouping objects by shard and processing them in batches, since a shard can process multiple objects at once. Additionally, we can use a worker pool to parallel grouping and processing.

The main difference between the two solutions is performance and the amount of code that needs to change:

  • Solution 1 is less effective than solution 2.
  • Solution 1 uses a worker pool to call an already defined method, while solution 2 requires to change the entire method's code.

For implementation details, please refer to #1476 (OUTDATED).


Using a worker pool

Data

Results

goos: linux
goarch: amd64
pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/engine
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
                                 │    old.txt    │              new1.txt               │
                                 │    sec/op     │   sec/op     vs base                │
InhumeMultipart/objects=1-12         11.42m ± 1%   11.42m ± 1%        ~ (p=0.739 n=10)
InhumeMultipart/objects=10-12       113.51m ± 0%   11.62m ± 1%  -89.76% (p=0.000 n=10)
InhumeMultipart/objects=100-12     1135.41m ± 1%   28.30m ± 1%  -97.51% (p=0.000 n=10)
InhumeMultipart/objects=1000-12    11357.8m ± 0%   259.8m ± 1%  -97.71% (p=0.000 n=10)
InhumeMultipart/objects=10000-12    113.251 ± 0%    2.277 ± 1%  -97.99% (p=0.000 n=10)
geomean                               1.136        74.03m       -93.48%

Grouping objects by shard and using a worker pool

Data

Results

goos: linux
goarch: amd64
pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/engine
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
                                 │    old.txt     │               new2.txt               │
                                 │     sec/op     │    sec/op     vs base                │
InhumeMultipart/objects=1-12          11.42m ± 1%   10.73m ±  0%   -6.12% (p=0.000 n=10)
InhumeMultipart/objects=10-12        113.51m ± 0%   11.00m ±  1%  -90.31% (p=0.000 n=10)
InhumeMultipart/objects=100-12      1135.41m ± 1%   22.36m ±  1%  -98.03% (p=0.000 n=10)
InhumeMultipart/objects=1000-12    11357.77m ± 0%   41.97m ± 24%  -99.63% (p=0.000 n=10)
InhumeMultipart/objects=10000-12   113250.7m ± 0%   273.9m ±  3%  -99.76% (p=0.000 n=10)
geomean                                1.136        31.36m        -97.24%
Two solutions were proposed: 1. Using a worker pool to process objects in parallel. 2. Grouping objects by shard and processing them in batches, since a shard can process multiple objects at once. Additionally, we can use a worker pool to parallel grouping and processing. The main difference between the two solutions is performance and the amount of code that needs to change: - Solution 1 is less effective than solution 2. - Solution 1 uses a worker pool to call an already defined method, while solution 2 requires to change the entire method's code. For implementation details, please refer to #1476 (OUTDATED). --- ### Using a worker pool #### Data - benchmark results: - [old.txt](/attachments/fde01cfc-839e-4d10-a8ba-a9020235227d) - [new1.txt](/attachments/e03748ff-ab68-47be-b2a5-0ef92d5f0f6a) - on-CPU profile: [on-cpu-new1.pb.gz](/attachments/00314721-03ef-4b8d-899c-4af9380549fe) - memory profile: [mem-new1.pb.gz](/attachments/95ec18fe-a491-4b49-938f-a717c415a6c2) - off-CPU profile: [off-cpu-new1.pb.gz](/attachments/c122b406-5493-4b4b-9e33-6b442a753c41) #### Results ``` goos: linux goarch: amd64 pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/engine cpu: 12th Gen Intel(R) Core(TM) i5-1235U │ old.txt │ new1.txt │ │ sec/op │ sec/op vs base │ InhumeMultipart/objects=1-12 11.42m ± 1% 11.42m ± 1% ~ (p=0.739 n=10) InhumeMultipart/objects=10-12 113.51m ± 0% 11.62m ± 1% -89.76% (p=0.000 n=10) InhumeMultipart/objects=100-12 1135.41m ± 1% 28.30m ± 1% -97.51% (p=0.000 n=10) InhumeMultipart/objects=1000-12 11357.8m ± 0% 259.8m ± 1% -97.71% (p=0.000 n=10) InhumeMultipart/objects=10000-12 113.251 ± 0% 2.277 ± 1% -97.99% (p=0.000 n=10) geomean 1.136 74.03m -93.48% ``` --- ### Grouping objects by shard and using a worker pool #### Data - benchmark results: - [old.txt](/attachments/fde01cfc-839e-4d10-a8ba-a9020235227d) - [new2.txt](/attachments/5b72f780-dc63-4b6c-861f-0905e9a95cd9) - on-CPU profile: [on-cpu-new2.pb.gz](/attachments/3b0b6b04-bbfc-4367-a183-0aadf6a2d8b8) - memory profile: [mem-new2.pb.gz](/attachments/2a698549-49b4-4f56-b888-7d64752edcdd) - off-CPU profile: [off-cpu-new2.pb.gz](/attachments/45998d25-ee2c-497a-8d16-19da68ef2875) #### Results ``` goos: linux goarch: amd64 pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/engine cpu: 12th Gen Intel(R) Core(TM) i5-1235U │ old.txt │ new2.txt │ │ sec/op │ sec/op vs base │ InhumeMultipart/objects=1-12 11.42m ± 1% 10.73m ± 0% -6.12% (p=0.000 n=10) InhumeMultipart/objects=10-12 113.51m ± 0% 11.00m ± 1% -90.31% (p=0.000 n=10) InhumeMultipart/objects=100-12 1135.41m ± 1% 22.36m ± 1% -98.03% (p=0.000 n=10) InhumeMultipart/objects=1000-12 11357.77m ± 0% 41.97m ± 24% -99.63% (p=0.000 n=10) InhumeMultipart/objects=10000-12 113250.7m ± 0% 273.9m ± 3% -99.76% (p=0.000 n=10) geomean 1.136 31.36m -97.24% ```
Member

According to this comment, one Inhume operation can block another when using a blocking worker pool. Since object grouping itself already increases Inhume speed, let's use only that.

For implementation details, please refer to #1476.


Grouping objects by shard before Inhume

Data

Results

goos: linux
goarch: amd64
pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/engine
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
                                 │   old.txt    │              new.txt                │
                                 │    sec/op    │   sec/op     vs base                │
InhumeMultipart/objects=1-12        11.42m ± 1%   10.71m ± 0%   -6.27% (p=0.000 n=10)
InhumeMultipart/objects=10-12       113.5m ± 0%   100.9m ± 3%  -11.08% (p=0.000 n=10)
InhumeMultipart/objects=100-12     1135.4m ± 1%   681.3m ± 2%  -40.00% (p=0.000 n=10)
InhumeMultipart/objects=1000-12     11.358 ± 0%    1.089 ± 1%  -90.41% (p=0.000 n=10)
InhumeMultipart/objects=10000-12   113.251 ± 0%    1.645 ± 1%  -98.55% (p=0.000 n=10)
geomean                              1.136        265.5m       -76.63%
According to [this comment](https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/1476#issuecomment-58396), one `Inhume` operation can block another when using a blocking worker pool. Since object grouping itself already increases `Inhume` speed, let's use only that. For implementation details, please refer to #1476. --- ### Grouping objects by shard before `Inhume` #### Data - benchmark results: - [old.txt](/attachments/1ff2416b-6c1d-46a8-a958-68e9a66d8fe3) - [new.txt](/attachments/868c14cc-4da5-4eeb-8877-b6b7830deb20) - on-CPU profile: [on-cpu-new.pb.gz](/attachments/3cb22522-212c-41e7-af45-a70c7bbf9f81) - memory profile: [mem-new.pb.gz](/attachments/1da75b2b-c28c-48ad-890f-7ef6330f9dea) - off-CPU profile: [off-cpu-new.pb.gz](/attachments/367d620a-933f-458d-ba06-bc8cef12bc0c) #### Results ``` goos: linux goarch: amd64 pkg: git.frostfs.info/TrueCloudLab/frostfs-node/pkg/local_object_storage/engine cpu: 12th Gen Intel(R) Core(TM) i5-1235U │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ InhumeMultipart/objects=1-12 11.42m ± 1% 10.71m ± 0% -6.27% (p=0.000 n=10) InhumeMultipart/objects=10-12 113.5m ± 0% 100.9m ± 3% -11.08% (p=0.000 n=10) InhumeMultipart/objects=100-12 1135.4m ± 1% 681.3m ± 2% -40.00% (p=0.000 n=10) InhumeMultipart/objects=1000-12 11.358 ± 0% 1.089 ± 1% -90.41% (p=0.000 n=10) InhumeMultipart/objects=10000-12 113.251 ± 0% 1.645 ± 1% -98.55% (p=0.000 n=10) geomean 1.136 265.5m -76.63% ```
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1450
No description provided.