Try using badger for the write-cache #421
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#421
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Last time we touched it it was losing data on hard resets.
The new release can be better, though.
Would be good to clarify how exactly it was losing data. I suppose you mean that when the process was stopped, some writes were lost because they were not flushed to disk yet? If that's the case, it's kinda expected. LevelDB has a sync mode for that purpose.
If not, was there an open issue about it where we can track whether they did something about it?
Here are the write load test results for bbolt-backed writecache (from
master
branch) and the badger-backed writecache (#462):Badger seems to be worse for the 64KiB case.
Can you explain this?
The median is lower, but p(90) is significantly higher, I am wondering why: for 10-min test it seems significant.
I think it's not due to the object size specifically, but about VU=1. It basically causes it to write serially, which is not that fast by design (more info here).
But tests with more VUs run out of disk space in the test hardware.
I reran the tests using the correct volumes:
One of the downsides of rewriting writecache this way, rather than just replacing the underlying store as in #454, is that now it's harder to gauge what is the concrete cause of improvements or regressions (it could be badger itself, but the key encoding, removing fstree and so on, could also have an effect).
Were metrics collected during tests? It is unclear if this bbolt degradation is permanent, or was there some kind of stop the world?
(I assume you mean badger degradation)
I didn't collect the metrics unfortunately, but I would also like to collect some data with time information. I don't think badger has a stop-the-world GC strategy, but still (it runs GC every minute).
We tried, but the work wasn't finished and the support cost was nonnegligible. So it was dropped in #887
Bitcask implementation seems like a better choice (refs #654), it also doesn't suffer from all "panic when disk is removed" problems, because it doesn't use mmap.