writecache: Fix race condition when reporting cache size metrics #1648

Merged
dstepanov-yadro merged 1 commit from a-savchuk/frostfs-node:fix-write-cache-object-count-metric into master 2025-02-20 06:15:39 +00:00

1 commit

Author SHA1 Message Date
02f3a7f65c
[#1648] writecache: Fix race condition when reporting cache size metrics
All checks were successful
Vulncheck / Vulncheck (push) Successful in 1m2s
Build / Build Components (push) Successful in 1m53s
Pre-commit hooks / Pre-commit (push) Successful in 2m7s
Tests and linters / Run gofumpt (push) Successful in 2m45s
Tests and linters / Tests with -race (push) Successful in 3m17s
Tests and linters / Lint (push) Successful in 3m20s
Tests and linters / gopls check (push) Successful in 3m15s
Tests and linters / Staticcheck (push) Successful in 3m21s
Tests and linters / Tests (push) Successful in 3m37s
OCI image / Build container images (push) Successful in 4m37s
DCO action / DCO (pull_request) Successful in 28s
Vulncheck / Vulncheck (pull_request) Successful in 50s
Pre-commit hooks / Pre-commit (pull_request) Successful in 1m28s
Build / Build Components (pull_request) Successful in 1m38s
Tests and linters / Run gofumpt (pull_request) Successful in 3m0s
Tests and linters / Tests with -race (pull_request) Successful in 3m10s
Tests and linters / Tests (pull_request) Successful in 3m14s
Tests and linters / Lint (pull_request) Successful in 3m21s
Tests and linters / Staticcheck (pull_request) Successful in 3m21s
Tests and linters / gopls check (pull_request) Successful in 3m34s
There is a race condition when multiple cache operation try to report
the cache size metrics simultaneously. Consider the following example:
- the initial total size of objects stored in the cache size is 2
- worker X deletes an object and reads the cache size, which is 1
- worker Y deletes an object and reads the cache size, which is 0
- worker Y reports the cache size it learnt, which is 0
- worker X reports the cache size it learnt, which is 1

As a result, the observed cache size is 1 (i. e. one object remains
in the cache), which is incorrect because the actual cache size is 0.

To fix this, let's report the metrics periodically in the flush loop.

Signed-off-by: Aleksey Savchuk <a.savchuk@yadro.com>
2025-02-19 17:05:40 +03:00