writecache: Fix race condition when reporting cache size metrics #1648

a-savchuk · 2025-02-18T10:39:01Z

a-savchuk commented

2025-02-18 10:39:01 +00:00

There is a race condition when multiple cache operation try to report the cache size metrics simultaneously. Consider the following example:

the initial total size of objects stored in the cache size is 2
flush worker X deletes an object and reads the cache size, which is 1
flush worker Y deletes an object and reads the cache size, which is 0
flush worker Y reports the cache size it learnt, which is 0
flush worker X reports the cache size it learnt, which is 1

As a result, the observed cache size is 1 (i. e. one object remains in the cache), which is incorrect because the actual cache size is 0.

The behavior described above has been observed as shown in the screenshot below:

Also, this behavior is easily reproducible:

Also I added a debug log to track the reported cache object count values, here you can see the problem:

{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 15}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 14}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 10}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 9}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 8}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 7}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 6}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 5}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 4}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 3}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 1}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 0}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 2}

To fix this, let's report the metrics periodically in the flush loop.

There is a race condition when multiple cache operation try to report the cache size metrics simultaneously. Consider the following example: - the initial total size of objects stored in the cache size is 2 - flush worker X deletes an object and reads the cache size, which is 1 - flush worker Y deletes an object and reads the cache size, which is 0 - flush worker Y reports the cache size it learnt, which is 0 - flush worker X reports the cache size it learnt, which is 1 As a result, the observed cache size is 1 (i. e. one object remains in the cache), which is incorrect because the actual cache size is 0. The behavior described above has been observed as shown in the screenshot below: ![wc_count_bug_1.png](/attachments/6caebf19-23f9-499c-a661-51c26951ab9b) Also, this behavior is easily reproducible: ![wc_count_bug_2.png](/attachments/7d7af4b7-c0a0-4964-a526-6e9a7d3b52ee) Also I added a debug log to track the reported cache object count values, here you can see the problem: ``` {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 15} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 14} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 10} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 9} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 8} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 7} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 6} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 5} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 4} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 3} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 1} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 0} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 2} ``` To fix this, let's report the metrics periodically in the flush loop.

wc_count_bug_1.png

146 KiB

wc_count_bug_2.png

66 KiB

a-savchuk added the

bug

label 2025-02-18 10:39:01 +00:00

a-savchuk added 1 commit 2025-02-18 10:39:02 +00:00

[#xx] writecache: Fix race condition when reporting cache size metrics

DCO action / DCO (pull_request) Failing after 34s

Details

Vulncheck / Vulncheck (pull_request) Successful in 1m0s

Details

Build / Build Components (pull_request) Successful in 1m30s

Details

Tests and linters / Run gofumpt (pull_request) Successful in 1m22s

Details

Pre-commit hooks / Pre-commit (pull_request) Successful in 1m40s

Details

Tests and linters / Staticcheck (pull_request) Successful in 2m30s

Details

Tests and linters / Tests (pull_request) Successful in 2m33s

Details

Tests and linters / gopls check (pull_request) Successful in 2m54s

Details

Tests and linters / Tests with -race (pull_request) Successful in 3m9s

Details

Tests and linters / Lint (pull_request) Successful in 3m26s

Details

f3c141de41

There is a race condition when multiple cache operation try to report
the cache size metrics simultaneously. Consider the following example:
- the initial total size of objects stored in the cache size is 2
- worker X deletes an object and reads the cache size, which is 1
- worker Y deletes an object and reads the cache size, which is 0
- worker Y reports the cache size it learnt, which is 0
- worker X reports the cache size it learnt, which is 1

As a result, the observed cache size is 1 (i. e. one object remains
in the cache), which is incorrect because the actual cache size is 0.

To fix this, a separate worker for reporting the cache size metric has
been created. All operations should use a queue (a buffered channel) to
request the reporter worker to report the metrics. Currently, all queue
writes are non-blocking.

Signed-off-by: Aleksey Savchuk <a.savchuk@yadro.com>

a-savchuk force-pushed fix-write-cache-object-count-metric from f3c141de41 to b28e45f5a4

2025-02-18 10:43:08 +00:00

Compare

a-savchuk reviewed 2025-02-18 12:19:04 +00:00

pkg/local_object_storage/writecache/cache.go Outdated

					
				@ -99,2 +105,4 @@

				}

				func (c *cache) initSizeMetricsReporter() {

					c.sizeMetricsReporterQueue = make(chan struct{}, sizeMetricsReporterQueueSize)

a-savchuk commented

2025-02-18 12:19:04 +00:00

I suggest to use a buffered channel rather than an atomic/mutex. If you think different, feel free to argue. Also, let me know if you don't agree on the channel size. I think it can be calculated based on the cache parameters, for example like this

queueSize = min(maxQueueSize, max(maxCacheSize/maxObjectSize, maxCacheCount))`

I suggest to use a buffered channel rather than an atomic/mutex. If you think different, feel free to argue. Also, let me know if you don't agree on the channel size. I think it can be calculated based on the cache parameters, for example like this ```go queueSize = min(maxQueueSize, max(maxCacheSize/maxObjectSize, maxCacheCount))` ```

a-savchuk marked this conversation as resolved

a-savchuk changed title from ~~WIP: writecache: Fix race condition when reporting cache size metrics~~ to writecache: Fix race condition when reporting cache size metrics

2025-02-18 12:31:42 +00:00

requested reviews from storage-core-committers, storage-core-developers

2025-02-18 12:31:42 +00:00

a-savchuk changed title from ~~writecache: Fix race condition when reporting cache size metrics~~ to WIP: writecache: Fix race condition when reporting cache size metrics

2025-02-18 13:11:55 +00:00

a-savchuk force-pushed fix-write-cache-object-count-metric from b28e45f5a4 to 840fc3a31b

2025-02-19 11:45:06 +00:00

Compare

a-savchuk force-pushed fix-write-cache-object-count-metric from 840fc3a31b to 8f776b2f41

2025-02-19 12:10:01 +00:00

Compare

a-savchuk removed the

bug

label 2025-02-19 12:32:33 +00:00

a-savchuk changed title from ~~WIP: writecache: Fix race condition when reporting cache size metrics~~ to writecache: Fix race condition when reporting cache size metrics

2025-02-19 13:49:08 +00:00

dstepanov-yadro requested changes 2025-02-19 14:03:36 +00:00

Dismissed

pkg/local_object_storage/writecache/storage.go Outdated

					
				@ -52,4 +52,2 @@

						)

						c.metrics.Evict(StorageTypeFSTree)

						// counter changed by fstree

						c.estimateCacheSize()

dstepanov-yadro commented

2025-02-19 14:03:34 +00:00

These c.estimateCacheSize() calls do not need to be deleted.
If there are many objects in the cache, the counters may not be updated for a very long time, because iteration over all objects can take tens of minutes.

These `c.estimateCacheSize()` calls do not need to be deleted. If there are many objects in the cache, the counters may not be updated for a very long time, because iteration over all objects can take tens of minutes.

a-savchuk commented

2025-02-19 14:33:53 +00:00

Fixed

dstepanov-yadro marked this conversation as resolved

a-savchuk force-pushed fix-write-cache-object-count-metric from 8f776b2f41 to 02f3a7f65c

2025-02-19 14:06:59 +00:00

Compare

dstepanov-yadro approved these changes 2025-02-19 14:57:51 +00:00

requested reviews from storage-core-committers, storage-core-developers

2025-02-19 18:36:20 +00:00

aarifullin approved these changes 2025-02-20 05:54:07 +00:00

dstepanov-yadro merged commit 02f3a7f65c into master