Blobovnicza tree rebuild (support) #703
No reviewers
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#703
Loading…
Reference in a new issue
No description provided.
Delete branch "dstepanov-yadro/frostfs-node:feat/shard_migrator_support_v0.37"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Relates #661
Relates #698
1105bec70a
to9bf021ea1f
9bf021ea1f
to74f2ee699a
839e08c489
to87548e3a4d
87548e3a4d
to8f9f1de48c
8f9f1de48c
to79ad3f767f
79ad3f767f
toa323c175e7
dfc3db24ab
to9b6c5004fa
WIP: Blobovnicza tree rebuild (support)to Blobovnicza tree rebuild (support)9b6c5004fa
toc623b13a3a
Tests with race time out, could there be a deadlock?
@ -0,0 +72,4 @@
}
}
func TestObjectsAvailableAfterDepthAndWithEdit(t *testing.T) {
Typo
With -> Width
.fixed
c623b13a3a
to61e8a0cf31
60a68e1b4b
to16d157154c
16d157154c
to1edcbe9865
1edcbe9865
tof999a36880
Cheched locally, it's ok.
@ -0,0 +59,4 @@
return nil
}
if err := bucket.Delete(key); err != nil {
Maybe also delete the bucket if it is empty?
fixed
@ -0,0 +83,4 @@
return bucket.ForEach(func(k, v []byte) error {
var addr oid.Address
storageID := make([]byte, len(v))
if e := addressFromKey(&addr, k); e != nil {
Why not
err
?fixed
@ -23,3 +23,2 @@
func (db *activeDB) Path() string {
return db.shDB.Path()
func (db *activeDB) SystemPath() string {
Why this change?
blobovniczatree
uses conceptpath
as internal path inside tree (like0/0/0.db
), but activeDB returns system path (like/storage/blobovnicza/0/0/0.db
). So i thinkSystemPath()
is more understandable.@ -60,2 +62,4 @@
}
func (c *dbCache) EvictAndMarkNonCached(path string) {
c.pathLock.Lock(path) //order important
Can we move ordering comments to the field definitions?
inline comments could easily be missed when writing new code.
fixed
@ -38,0 +53,4 @@
eg, egCtx := errgroup.WithContext(ctx)
eg.SetLimit(b.blzInitWorkersCount)
visited := make(map[string]bool)
Why not
map[string]struct{}
?fixed
@ -40,0 +94,4 @@
}
defer shBlz.Close()
b.log.Debug(logs.BlobovniczatreeBlobovniczaSuccessfullyInitializedClosing, zap.String("id", p))
SuccessfullyInitialized
, but we only callOpen
withoutInit
, is this expected?shBlz
is wrapper overblobovnicza
, soshBlz.Open()
calls blobovnicza'sOpen()
andInit()
.@ -67,0 +145,4 @@
if err := b.addDBExtensionToDBs(filepath.Join(path, entry.Name()), depth+1); err != nil {
return err
}
} else {
I like
continue
+ smaller indentation forelse
better, but don't insist.fixed
@ -42,12 +45,22 @@ func (b *Blobovniczas) Delete(ctx context.Context, prm common.DeletePrm) (res co
return common.DeleteRes{}, common.ErrReadOnly
}
if b.rebuildGuard.TryRLock() {
TryRLock
is not recommended, I would like to avoid it if possible. Can we do the same with atomics?I don't know how to correct do the same with atomics. Now
Delete
call can perform concurrently , but can't perform concurrently with Rebuild.So i believe that this is exactly the case when the use of TryLock is justified.
@ -159,3 +169,1 @@
// iterator over the paths of Blobovniczas in random order.
func (b *Blobovniczas) iterateLeaves(ctx context.Context, f func(string) (bool, error)) error {
return b.iterateSortedLeaves(ctx, nil, f)
// iterateExistedDBPaths iterates over the paths of Blobovniczas without any order.
Existed
orExisting
?fixed
@ -162,0 +175,4 @@
}
func (b *Blobovniczas) iterateExistedDBPathsDFS(ctx context.Context, path string, f func(string) (bool, error)) (bool, error) {
if path == "" {
Why not move this locking to the caller?
Fixed
@ -162,0 +182,4 @@
sysPath := filepath.Join(b.rootPath, path)
entries, err := os.ReadDir(sysPath)
if os.IsNotExist(err) && b.readOnly && path == "" { //non initialized tree in read only mode
Why do we NOT return error in this case? Was it the case in the previous implementation?
if b.readOnly {
DB files and dirs may be missing in case of read only mode.
@ -162,0 +215,4 @@
return false, nil
}
func (b *Blobovniczas) iterateSortedDBPaths(ctx context.Context, addr oid.Address, f func(string) (bool, error)) error {
iterateSortedDBPaths
iterates over ALL databasesiterateSortedLeaves
iterates over databases according to configAm I correct?
Yes
@ -162,0 +220,4 @@
return err
}
func (b *Blobovniczas) iterateSorderDBPathsInternal(ctx context.Context, path string, addr oid.Address, f func(string) (bool, error)) (bool, error) {
s/Sorder/Sorted/ ?
Actually,
Sorted
looks like a nice abbreviation for "sorted order" :)fixed
@ -162,0 +237,4 @@
var dbIdxs []uint64
var dirIdxs []uint64
for _, entry := range entries {
If we iterate during rebuild the object could be iterated twice.
It doesn't seem like a big problem, but IMO worth mentioning somewhere in docs.
@ -98,32 +106,108 @@ func (b *sharedDB) Close() {
}
b.refCount--
if b.refCount == 1 {
What scenario does this branch corresponds to?
See
This is where the caller waits on condition variable until there is only one reference left.
@ -127,0 +202,4 @@
}
func (m *levelDbManager) hasAnyDB() bool {
m.dbMtx.Lock()
Why is it
Lock
and notRLock
?fixed
@ -0,0 +417,4 @@
}
type addressMap struct {
data map[oid.Address]bool
Again, if we only store
true
here, what aboutmap[oid.Address]struct{}
?fixed. It just shortens the code.
@ -0,0 +6,4 @@
semaphore chan struct{}
}
func newRebuildLimiter(workersCount uint32) *rebuildLimiter {
We already have errgroup and ants.Pool which allow limiting concurrent routines, why do we create yet another limiter here?
In general
blobovniczatree.Rebuild
requires from caller some limiter, but not pool to run tasks.errgroup
orants.Pool
doen't look like something that is suitable for this case: it will be difficult to cancel operations, and rebuilds from different shards should not depend on each other's errors.@ -67,3 +64,1 @@
s.deleteObjectFromWriteCacheSafe(ctx, addr)
s.deleteFromBlobstorSafe(ctx, addr)
if err := s.deleteFromBlobstor(ctx, addr); err != nil {
Why do we no longer delete from the write-cache? It is not obvious from the commit message.
updated commit message with explanation, that objects will be deleted from writecache after flush to blobstore.
@ -0,0 +30,4 @@
}
func (r *rebuilder) Start(ctx context.Context, bs *blobstor.BlobStor, mb *meta.DB, log *logger.Logger) {
r.mtx.Lock()
Is it possible to call this concurrently?
Anyway, why do we block here and not return immediately with "already started"?
This could (?) lead to
Stop()
stopping only the first started instance.Actually no, mutes is just foolproof.
I don't know what to do with such error. The idea was just to maintain the correct state.
Stop() stores nil to cancel.
f999a36880
toc9fc6ec535
c9fc6ec535
toe0349eb50b
e0349eb50b
todb1e129883
db1e129883
tod6b3af013d
d6b3af013d
to0179b6ebe8
0179b6ebe8
to5c3084163b
5c3084163b
to7641f1cacc
7641f1cacc
to425a6141ef
425a6141ef
to9f77e8311f
9f77e8311f
to9c3c4e2cb7
9c3c4e2cb7
to18041273e0
21d6669161
to5943df1efe
e72dc0b286
todc56e70587
5c1b09153a
tofd4c91d96e
fd4c91d96e
to589dddb6d4
32334a52cd
tob2215fe640
07909edb8d
to2552db1eb4
2552db1eb4
to660e8a8002
660e8a8002
to9de673c37a
9de673c37a
tod875842aeb
d875842aeb
to7569671c66
7569671c66
to9769b2e4ec
9769b2e4ec
to60df32379f
c056bbb8e5
to13a2b3b8b5
13a2b3b8b5
to02450a9a16