Measure tombstone placement duration depending on tombstone size #1450
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#1450
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Tombstone object contains a set of object addresses that should be marked as removed. When tombstone object is processed by object service, it inhumes these addresses in the metabase. The more object tombstone has, the longer it is processed in object service.
It would be nice to see how long it takes to perform
object.Put
operation with different tombstone sizes, e.g. 1, 10, 1000 and 10 000 addresses. If it takes quite a while, maybe there is a room for optimizations to bulk inhumes or do something else.Context
S3 Gateways supports multipart uploads: when single S3 object is presented as a bunch of FrostFS objects. The limit is up to 10 000 parts.
Right now S3 Gateway removes multipart object inefficiently: it calls
object.Delete
for every part sequentially.The alternative is to collect one big tombstone object and store it. However @mbiryukova noticed that
object.Put
operation for tombstone may take quite some time. Calculations may be inaccurate, but in the virtual environment tombstone of 1024 addresses took almost a minute (50s). It would be nice to check it independently in this issue.I've decided to write a benchmark for the
(*StorageEngine).Inhume
method. I ran it for 1, 10, 100, 1.000, and 10.000 addresses in a tombstone, and I collected a CPU profile and memory profile for a run with 10.000 addresses. I attached flame charts for each profile and highlighted most interesting places. Feel free to criticize my methodology and share your thoughts, as those could help improve it.Source code link
Benchmark results
Each storage engine had 100 shards. I tried to investigate whether the number of shards affected performance, but the impact was insignificant.
CPU profile (for 10.000 parts only)
Data: cpu_10000_parts.pb.gz
1. An entire chart for the benchmark
2. Zoom on
(*StorageEngine).Inhume
3. Zoom on
(*StorageEngine).IsLocked
4. Zoom on
(*StorageEngine).inhumeAddr
Memory profile (for 10.000 parts only)
Data: mem_10000_parts.pb.gz
1. An entire chart for the benchmark
2. Zoom on
(*StorageEngine).Inhume
3. Zoom on
(*StorageEngine).IsLocked
4. Zoom on
(*StorageEngine).inhumeAddr
What I can see right now
Most of time we check whether some of objects, that'll be inhumed, are locked (that chart). It's mostly because we convert each address to string when starting trace span (that chart), and we do it several times, in the
(*Shard).IsLocked
and(*DB).IsLocked
.ctx, span := tracing.StartSpanFromContext(ctx, "Shard.IsLocked",
trace.WithAttributes(
attribute.String("shard_id", s.ID().String()),
attribute.String("address", addr.EncodeToString()),
))
_, span := tracing.StartSpanFromContext(ctx, "metabase.IsLocked",
trace.WithAttributes(
attribute.String("address", prm.addr.EncodeToString()),
))
While we don't do that in the
(*DB).Inhume
._, span := tracing.StartSpanFromContext(ctx, "metabase.Inhume")
I collected an off-CPU profile as I was advised by @dstepanov-yadro *thanks*. Here's what I found
Off-CPU profile (for 10.000 parts only)
Data: off_cpu_10000_parts.pb.gz
1. Zoom on
(*StorageEngine).Inhume
2. Zoom on
(*DB).Inhume
My observations
According to this chart, the time distribution in
(*StorageEngine).Inhume
has changed significantly compared to the previous chart.Let's take a look at this chart. We ran the benchmark 5 times with 10.000 addresses to inhume. If we call
bbolt.(*DB).Batch
in a loop with the default bbolt batch delay of 10 millisec, the expected time would be about 5 * 10.000 * 0,01 sec = 500 sec. According to the chart, we spent about 0.15 hrs = 540 sec in(*DB).Inhume
, which indicates that we're usingbbolt.(*DB).Batch
inefficiently in this case.Not necessarily. In isolation, yes, we do. In reality there are concurrent queries coming from user, so it is not obvious, whether
Batch
is used inefficiently.@a-savchuk Could you share benchmark setup and code?
I added them in the first comment
Inhume
operation to improve speed with large object sets #1476StorageEngine
's test utils #1491Two solutions were proposed:
The main difference between the two solutions is performance and the amount of code that needs to change:
For implementation details, please refer to #1476 (OUTDATED).
Using a worker pool
Data
Results
Grouping objects by shard and using a worker pool
Data
Results
According to this comment, one
Inhume
operation can block another when using a blocking worker pool. Since object grouping itself already increasesInhume
speed, let's use only that.For implementation details, please refer to #1476.
Grouping objects by shard before
Inhume
Data
Results