writecache: Fix race condition when reporting cache size metrics #1648

Merged
dstepanov-yadro merged 1 commit from a-savchuk/frostfs-node:fix-write-cache-object-count-metric into master 2025-02-20 06:15:39 +00:00
Member

There is a race condition when multiple cache operation try to report the cache size metrics simultaneously. Consider the following example:

  • the initial total size of objects stored in the cache size is 2
  • flush worker X deletes an object and reads the cache size, which is 1
  • flush worker Y deletes an object and reads the cache size, which is 0
  • flush worker Y reports the cache size it learnt, which is 0
  • flush worker X reports the cache size it learnt, which is 1

As a result, the observed cache size is 1 (i. e. one object remains in the cache), which is incorrect because the actual cache size is 0.

The behavior described above has been observed as shown in the screenshot below:
wc_count_bug_1.png

Also, this behavior is easily reproducible:
wc_count_bug_2.png

Also I added a debug log to track the reported cache object count values, here you can see the problem:

{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 15}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 14}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 10}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 9}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 8}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 7}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 6}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 5}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 4}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 3}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 1}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 0}
{"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 2}

To fix this, let's report the metrics periodically in the flush loop.

There is a race condition when multiple cache operation try to report the cache size metrics simultaneously. Consider the following example: - the initial total size of objects stored in the cache size is 2 - flush worker X deletes an object and reads the cache size, which is 1 - flush worker Y deletes an object and reads the cache size, which is 0 - flush worker Y reports the cache size it learnt, which is 0 - flush worker X reports the cache size it learnt, which is 1 As a result, the observed cache size is 1 (i. e. one object remains in the cache), which is incorrect because the actual cache size is 0. The behavior described above has been observed as shown in the screenshot below: ![wc_count_bug_1.png](/attachments/6caebf19-23f9-499c-a661-51c26951ab9b) Also, this behavior is easily reproducible: ![wc_count_bug_2.png](/attachments/7d7af4b7-c0a0-4964-a526-6e9a7d3b52ee) Also I added a debug log to track the reported cache object count values, here you can see the problem: ``` {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 15} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 14} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 11} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 10} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 9} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 8} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 7} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 6} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 5} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 4} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 3} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 1} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 0} {"shard_id": "UBvw54XR4MR99EAMqRTwvu", "count": 2} ``` To fix this, let's report the metrics periodically in the flush loop.
a-savchuk added the
bug
label 2025-02-18 10:39:01 +00:00
a-savchuk added 1 commit 2025-02-18 10:39:02 +00:00
[#xx] writecache: Fix race condition when reporting cache size metrics
Some checks failed
DCO action / DCO (pull_request) Failing after 34s
Vulncheck / Vulncheck (pull_request) Successful in 1m0s
Build / Build Components (pull_request) Successful in 1m30s
Tests and linters / Run gofumpt (pull_request) Successful in 1m22s
Pre-commit hooks / Pre-commit (pull_request) Successful in 1m40s
Tests and linters / Staticcheck (pull_request) Successful in 2m30s
Tests and linters / Tests (pull_request) Successful in 2m33s
Tests and linters / gopls check (pull_request) Successful in 2m54s
Tests and linters / Tests with -race (pull_request) Successful in 3m9s
Tests and linters / Lint (pull_request) Successful in 3m26s
f3c141de41
There is a race condition when multiple cache operation try to report
the cache size metrics simultaneously. Consider the following example:
- the initial total size of objects stored in the cache size is 2
- worker X deletes an object and reads the cache size, which is 1
- worker Y deletes an object and reads the cache size, which is 0
- worker Y reports the cache size it learnt, which is 0
- worker X reports the cache size it learnt, which is 1

As a result, the observed cache size is 1 (i. e. one object remains
in the cache), which is incorrect because the actual cache size is 0.

To fix this, a separate worker for reporting the cache size metric has
been created. All operations should use a queue (a buffered channel) to
request the reporter worker to report the metrics. Currently, all queue
writes are non-blocking.

Signed-off-by: Aleksey Savchuk <a.savchuk@yadro.com>
a-savchuk force-pushed fix-write-cache-object-count-metric from f3c141de41 to b28e45f5a4 2025-02-18 10:43:08 +00:00 Compare
a-savchuk reviewed 2025-02-18 12:19:04 +00:00
@ -99,2 +105,4 @@
}
func (c *cache) initSizeMetricsReporter() {
c.sizeMetricsReporterQueue = make(chan struct{}, sizeMetricsReporterQueueSize)
Author
Member

I suggest to use a buffered channel rather than an atomic/mutex. If you think different, feel free to argue. Also, let me know if you don't agree on the channel size. I think it can be calculated based on the cache parameters, for example like this

queueSize = min(maxQueueSize, max(maxCacheSize/maxObjectSize, maxCacheCount))`
I suggest to use a buffered channel rather than an atomic/mutex. If you think different, feel free to argue. Also, let me know if you don't agree on the channel size. I think it can be calculated based on the cache parameters, for example like this ```go queueSize = min(maxQueueSize, max(maxCacheSize/maxObjectSize, maxCacheCount))` ```
a-savchuk marked this conversation as resolved
a-savchuk changed title from WIP: writecache: Fix race condition when reporting cache size metrics to writecache: Fix race condition when reporting cache size metrics 2025-02-18 12:31:42 +00:00
requested reviews from storage-core-committers, storage-core-developers 2025-02-18 12:31:43 +00:00
a-savchuk changed title from writecache: Fix race condition when reporting cache size metrics to WIP: writecache: Fix race condition when reporting cache size metrics 2025-02-18 13:11:55 +00:00
a-savchuk force-pushed fix-write-cache-object-count-metric from b28e45f5a4 to 840fc3a31b 2025-02-19 11:45:06 +00:00 Compare
a-savchuk force-pushed fix-write-cache-object-count-metric from 840fc3a31b to 8f776b2f41 2025-02-19 12:10:01 +00:00 Compare
a-savchuk removed the
bug
label 2025-02-19 12:32:33 +00:00
a-savchuk changed title from WIP: writecache: Fix race condition when reporting cache size metrics to writecache: Fix race condition when reporting cache size metrics 2025-02-19 13:49:08 +00:00
dstepanov-yadro requested changes 2025-02-19 14:03:36 +00:00
Dismissed
@ -52,4 +52,2 @@
)
c.metrics.Evict(StorageTypeFSTree)
// counter changed by fstree
c.estimateCacheSize()

These c.estimateCacheSize() calls do not need to be deleted.
If there are many objects in the cache, the counters may not be updated for a very long time, because iteration over all objects can take tens of minutes.

These `c.estimateCacheSize()` calls do not need to be deleted. If there are many objects in the cache, the counters may not be updated for a very long time, because iteration over all objects can take tens of minutes.
Author
Member

Fixed

Fixed
dstepanov-yadro marked this conversation as resolved
a-savchuk force-pushed fix-write-cache-object-count-metric from 8f776b2f41 to 02f3a7f65c 2025-02-19 14:06:59 +00:00 Compare
dstepanov-yadro approved these changes 2025-02-19 14:57:51 +00:00
requested reviews from storage-core-committers, storage-core-developers 2025-02-19 18:36:21 +00:00
aarifullin approved these changes 2025-02-20 05:54:07 +00:00
dstepanov-yadro merged commit 02f3a7f65c into master 2025-02-20 06:15:39 +00:00
Sign in to join this conversation.
No reviewers
TrueCloudLab/storage-core-committers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1648
No description provided.