LRU `Peek`/`Contains` take LRU mutex _inside_ of a `View` transaction.
`View` transaction itself takes `mmapLock` [1], which is lifted after tx
finishes (in `tx.Commit()` -> `tx.close()` -> `tx.db.removeTx`)
When we evict items from LRU cache mutex order is different:
first we take LRU mutex and then execute `Batch` which _does_ take
`mmapLock` in case we need to remap. Thus the deadlock.
[1] 8f4a7e1f92/db.go (L708)
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
To achieve high performance we must choose proper values for both
batch size and delay. For user operations we want to set low delay.
However it would prevent tree synchronization operations to form big
enough batches. For these operations, batching gives the most benefit
not only in terms of on-CPU execution cost, but also by speeding up
transaction persist (`fsync`).
In this commit we try merging batches that are already
_triggered_, but not yet _started to execute_. This way we can still
query batches for execution after the provided delay while also allowing
multiple formed batches to execute faster.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
"Object is expired" means that object is presented in `meta` but it is not
`ObjectNotFound` error. Previous implementation made `shard` search for an
object without `meta` which was an error.
Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
```
name old time/op new time/op delta
_addressFromString-8 1.25µs ±30% 1.02µs ± 6% -18.49% (p=0.000 n=9+9)
name old alloc/op new alloc/op delta
_addressFromString-8 352B ± 0% 256B ± 0% -27.27% (p=0.000 n=9+10)
name old allocs/op new allocs/op delta
_addressFromString-8 6.00 ± 0% 4.00 ± 0% -33.33% (p=0.000
n=10+10)
```
Also, assure compiler that `s` doesn't escape:
Before this commit:
```
./fstree.go:74:24: leaking param: s
./fstree.go:90:6: moved to heap: addr
```
After this commit:
```
./fstree.go:74:24: s does not escape
```
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Currently we track based on `PayloadSize`, because it is already stored
in the metabase and it is easier to calculate without slowing down the
whole system.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
Currently the only way to tell whether `evacuate/set-mode` is finished
is to set a very big timeout and _hope_ that the operation will finish.
In this commit we add INFO logs for such operations which should
simplify the life of an administrator.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
1. Reduce allocations inside transactions.
2. Do not encode container ID to string: it allocates a lot and takes more
space.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Under high load we are limited by the _amount_ of keys we need to update
in a single transaction. In this commit we try storing all state
with a single key.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
In the previous implementation any non-nil error that preceded object
fetching from blobstor led to iterating over every storage (in other words,
no storage ID information was taken into account). Now storage ID is
skipped only if metabase (storage ID source) returns any error.
Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
It should be similar to a `TreeAddByPath`. `applyOperation` is used for
`Apply` when the operation can be inserted in the middle of a log.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Under load changing shard mode can lead to it being removed from the
list during some other PUT.
```
Dec 28 07:01:26 az neofs-node[364505]: panic: runtime error: invalid memory address or nil pointer dereference
Dec 28 07:01:26 az neofs-node[364505]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc9fbb1]
Dec 28 07:01:26 az neofs-node[364505]: goroutine 11791912 [running]:
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).putToShard(0xc000435490, {0xc0003f7a28?, 0xc0001192c0?}, 0x2, {0x0, 0x>
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/put.go:91 +0x1b1
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).put.func1(0xc000435490?, {0xc0003f7a28?, 0xc0001192c0?})
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/put.go:71 +0x19c
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).iterateOverSortedShards(0x1?, {{0x62, 0x23, 0xfe, 0x60, 0x67, 0xd5, 0x>
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/shards.go:225 +0xc8
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).put(0xc000435490, {0x1?})
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/put.go:66 +0x2a9
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Put.func1()
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/put.go:43 +0x2a
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).execIfNotBlocked(0x8?, 0x38?)
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/control.go:147 +0xcf
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Put(0xc4df775a80?, {0x0?})
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/put.go:42 +0x65
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.Put(0xc06d928b80?, 0xc06b1b8dc8?)
Dec 28 07:01:26 az neofs-node[364505]: github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/put.go:158 +0x19
Dec 28 07:01:26 az neofs-node[364505]: main.engineWithoutNotifications.Put({0x20301b?}, 0x20301b?)
```
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Because synchronization _most likely_ will have apply already existing
operations, it is much faster to check their presence in a read
transaction. However, always doing this will degrade the perfomance
for normal `Apply`. And, let's be honest, it is already not good.
Thus we add a separate parameter which specifies whether this logic is
enabled.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Both `meta` and `write-cache` are expected to have a fast underlying disk,
so it does not seem like an optimisation. Moreover, `write-cache`'s `Head`
is a `Get` with payload cutting, it _must_ use more memory for no reason
(`meta` was created for such requests). Also, `write-cache` does not allow
performing any "meta" relations checks (such as locking, tombstoning).
Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
Node children are not sorted and could occur in any order.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
Includes:
1. mode change read lock operation in every exported method that r/w the
underlying database;
2. returning `ErrDegradedMode` logical error if any exported method is
called in degraded (without a metabase) mode.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Includes extending listing methods in the Storage Engine with object types.
It allows tuning replication/policer algorithms: container nodes do
not remove `LOCK` objects as redundant and try to fulfill `LOCK` placement
on the ohter container nodes.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
It allows keeping all the locked objects safe after metabase
resynchronization. Currently, all `LOCK` objects are broadcast to all nodes
in a container, it guarantees `LOCK` object presence in a regular situation.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
It was needed before we started to flush during transition to
`degraded` mode. Now it is confusing.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
It is not an error: removing virtual object is expected and should be just
skipped. Getting a virtual object with `raw` flag is considered as an
impossible action, all the virtual objects removals will be handled via
their children's removals implicitly.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Currently there is a possibility for modifying operations to fail
because of I/O errors and a new tree to be created on another shard.
This commit adds existence check for modifying operations.
Read operations remain as they are, not to slow things.
`TreeDrop` is an exception, because this is a tree removal and trying
multiple shards is not an unwanted behaviour.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
By default writecache puts the whole object to update storage ID.
This logic comes from the times when we needed to put objects
in the metabase by the writecache itself. Now this is done by the
blobstor at unmarshaling objects during flush only to update storage ID
is an overkill.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
All logic errors are wrapped in `logicerr.Logical` type and do not
affect shard error counter.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
In previous implementation `Shard.Delete` logged writecache's removal
failures in `error` level. There is a need to decrease severity of these
log records since they aren't critical and don't require individual
review.
Change level of the message to `info`.
Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
From the `Bucket.ForEach` doc:
```
The provided function must not modify the bucket; this will result in undefined behavior.
```
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
After the refactor there are new storage characteristics: a type and
a general storage id (that could be stringified).
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Iterate over every shard and search for the container's trees. Final result
is a concatenation of shards' results. It is considered that one fixed tree
is placed on one fixed shard but the different trees of a fixed container
could be placed on different shards.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
In the 2nd version, there was a database format change: buckets have changed
their keys, so it becomes impossible to check the version in the 1 -> 2+
migrations because of different buckets that store info about the version.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
If shard ID is stored in metabase (it is not the first time boot), read it,
set it, use it (not a generated one) in the metrics writer.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Make it store its internal `zap.Logger`'s level. Also, make all the
components to accept internal `logger.Logger` instead of `zap.Logger`; it
will simplify future refactor.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Currently, when removing shard special care must be taken with respect
to shard numbering. `mode: disabled` allows to leave shard configuration
in place while also ignoring it during initialization. This makes
disk replacement much more convenient.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
Degraded mode allows us to operate without an SSD,
thus writecache should be unavailable in this mode.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
Add a common error for this case because it is not an error
which should increase error counter. Single error simplifies checks on
the call-site.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
Negative values have no sense. On the other hand it differs from the
blobovnicza's configuration and prevents unification.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Allow user to initiate flushing objects from a writecache.
We need this in 2 cases:
1. During writecache storage schema update, it should be flushed with
the old version of node and started clean with a new one.
2. During SSD replacement, to avoid data loss.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
Use a simple loop at the callsite. This way we remove as much as we can.
Also, `Delete` metrics is more meaningful now.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
- Meta now supports (and requires) inc/dec labeled counters
- The new logic counter is incremented on `Put` operations and is
decremented on `Inhume` and `Delete` operations that are performed on
_stored_ objects only
- Allow force counters sync. "Force" mode should be used on metabase resync
and should not be used on a regular meta start
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Re-configuration of Blobovnicza's object size limit must not affect
already stored objects. In previous implementation `Blobovnicza` didn't
see objects stored in buckets which became too big after size limit
re-configuration. For example, lets consider 1st configuration with 64KB
size limit for stored objects. Lets assume that we stored object of 64KB
size, and re-configured `Blobovnicza` with 32KB limit. After reboot
object should be still available, but actually it isn't. This is caused
by `Get` operation algorithm which iterates over configured size ranges
only, and doesn't process any other existing size bucket. By the way,
increasing of the object size limit didn't lead to the problem even in
previous implementation.
Make `Blobovnicza.Get` method to iterate over all size buckets
regardless of current configuration. This covers the described scenario.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Set flush mark in the inside the flush worker because writing to the blobstor
can fail. Because each evicted object must be deleted, it is reasonable
to do this in the evict callback.
The evict callback is protected by LRU mutex and thus potentially interferes
with `Get` and `Iterate` methods. This problem will be addressed in the
future.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
Maintain an invariant that any blobovnicza is present either
in `opened` or in `active` map. Otherwise, the logic becomes too
complicate because it is not obvious when we should close the blobovnicza.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
This check should occur on the shard level, but because
blobstor components expose `Open(readOnly bool)` interface,
it is reasonable to expect an error here.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
If the file doesn't exist, return `apistatus.ObjectNotFound`.
First check is still there as a shortcut.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
This tests check that each blobstor component behaves similarly when
same methods are being used. It is intended to serve as a specification
for all future components.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
Includes:
1. Renaming counter key to distinguish logical and physical objects
2. Version update dropping since changes could be done in a compatible way
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
1. Move compression parameters to the `shard` section.
2. Allow to use multiple sub-storage components in the blobstor.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
1. Remove in-memory cache. It doesn't persist objects and if we want
more speed, `NoSync` option can be used for the bolt DB.
2. Put to the metabase in a synchronous fashion. This considerably
simplifies overall logic and plays nicely with the metabase bolt DB
batch settings.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Allow to extend blobstor with more storage sub-systems. Currently
objects stored in the FSTree have empty byte slice descriptor and object
from blobovnicza tree have the same id as earlier. Each such change in
the identifier formation should be accompanied with metabase version
increase.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Close#1647.
Initially, `Sync: false` was provided because we can already lose
objects cached in memory. However, future changes in writecache will
remove inmemory cache and speed up it via other means.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
If an object is found in the Write-cache and is placed at the end of
the in-memory cache, the memory counter update operation tries to
dereference the index that is out of the sliced array. Moreover, even if
panic does not appear, the counter is updated with the wrong value.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Return all the objects on the empty common prefix search without search
optimizations that breaks boltDB logic.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
If an object has not been marked for removal by the GC in the current epoch
yet but has already expired, respond with `ErrObjectNotFound` api status.
Also, optimize shard iteration: a node must stop any iteration if the object
is found but gonna be removed soon.
All the checks are performed by the Metabase.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
After a4adb79db new logical error could be returned. Do not increase
error counter in this case.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
If metabase can't be opened in the default mode, try opening shard
first in `ReadOnly` mode and then in `DegradedReadOnly`.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
`Degraded` mode can be set by the administrator if needed.
Modifying operations in this mode can lead node into an inconsistent state
because metabase checks such as lock checking are not performed.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
There is a need to support working w/o shard if it has problems with
blobovnicza tree.
Make `BlobStor.Init` to return new `ErrInitBlobovniczas` error. Remove
shard from storage engine's shard set if it returned this error from
`Init` call. So if some of the shards (but not all) return this error,
the node will be able to continue working without them.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Reduce public interface of this package. Later each result will contain
an additional status, so it makes more sense to use the same functions
and result processing everywhere.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
1. Modifying operations are not expected to fail, unless the shard is
read-only.
2. `Get*` operations should increase error counter too, unless the
error is `ErrTreeNotFound`.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Do not return backend type from the service for now, because memory
backend is expected to vanish.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
The tricky part here is the engine itself: we stop iteration on
`ErrReadOnly` because it is better to synchronize the shard later than
to have partial trees stored in 2 shards.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Current implementation prevents invalid operations to become valid at
some later point (consider adding a child to the non-existent parent and
then adding the parent). This seems to diverge from the paper algorithm
and complicates implementation. Make it simpler.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Currently to find a node by path we iterate over all the children on
each level. This is far from optimal and scales badly with the number of
nodes on a single level. Thus we introduce "indexed attributes" for
which an additional information is stored and which can be use in
`*ByPath` operations. Currently this set only includes `FileName`
attribute but this may change in future.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Consider a node `{FileName: "dir", Attribute: "xxx"}`. In case we add
a new node by path `["dir", "file.txt"]`, create a new intermediate node
with a single attribute.
`GetByPath` now also considers only nodes with a single attribute while building a path.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
In this commit we implement algorithm for CRDT trees from
https://martin.klepmann.com/papers/move-op.pdf
Each tree is identified by the ID of a container it belongs to
and the tree name itself. Essentially, it is a sequence of operations
which should be applied in chronological order to get a usual tree
representation.
There are 2 backends for now: bbolt database and in-memory.
In-memory backend is here for debugging and will eventually act
as a memory-cache for the on-disk database.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Replace `ErrRangeOutOfBounds` error from `pkg/core/object` package with
`ObjectOutOfRange` from `apistatus` package. That error is returned by
storage node's server as NeoFS API statuses.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Shard is intended to be used as a separate failure domain,
which usually resides on a separate disk. Thus, sequential
initialization is bound by IO and this change speeds up thing a bit.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
The main problem is to distinguish the case of initial initialization
and update from version 0. We can't do this at `Open`, because of
`resync_metabase` flag. Thus, the following approach was taken:
1. During `Open` check whether the metabase was initialized.
2. Check for the version in `Init` or write the new one if the metabase
is new.
3. Update version in `Reset`.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
`log.With` is suitable during initialization, but in other places it induces
some overhead, even when branches with logging are not taken.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
If we should process address based on some condition, there is no need
to read file content in memory.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Currently we use `(*bbolt.Bucket).Stats().KeyN` for estimating database
size. However, it iterates over all pages in bucket and thus heavily
depends on the bucket size. This commit replaces initial size estimation
with a single `os.Stat` call.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Core changes:
* avoid package-colliding variable naming
* avoid using pointers to IDs where unnecessary
* avoid using `idSDK` import alias pattern
* use `EncodeToString` for protocol string calculation and `String` for
printing
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
- Delete objects physically on tombstone's arrival;
- Store information about tombstones in the Graveyard;
- Clear Graveyard every epoch based on the information about TS in the
network.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Add offset element to the iterations over deleted objects (both the
Graveyard and the Garbage buckets).
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
It allows storing information about object in both ways at the same time:
1. Metabase should know if an object is covered by a tombstone (that is
not expired yet);
2. It should be possible to physically delete objects covered by a
tombstone immediately (mark with GC) but keep tombstone knowledge.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Morph "NewEpoch" event handling was registered in a closure over
`addNewEpochNotificationHandler` func. That may lead to the data race:
if a shard was initialized before the event registration, everything works
as planned, but if registration was made earlier, it was not able to
include GC handlers since a shard has not called `eventChanInit` yet and,
therefore, it has not registered handler yet.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Also, remove optimization comments:
1. Having to maintain an execute the same logic for headers as for
objects is quite inefficient, as it increases memory footprint.
2. Unmarshaling object is a cheap operation if data slice is in memory.
3. For unmarshaling header-only, I think we need SDK support.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
`Degraded` mode is set automatically after error counter is over the
threshold. `ReadOnly` mode can still be set by an administrator.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
`Batch` can execute the function multiple times leading to multiple
increases of a size approximation.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
`apistatus` package provides types which implement build-in `error`
interface. Add `error of type` pattern when documenting these errors in
order to clarify how these errors should be handled (e.g. `errors.Is` is
not good).
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Replace `ErrNotFound`/`ErrAlreadyRemoved` error from
`pkg/core/object` package with `ObjectNotFound`/`ObjectAlreadyRemoved`
one from `apistatus` package. These errors are returned by storage
node's server as NeoFS API statuses.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
There is a need to process expired `LOCK` objects similar to `TOMBSTONE`
ones: we collect them on `Shard`, notify all other shards about
expiration so they could unlock the objects, and only after that mark
lockers as garbage.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `FormatValidator.ValidateContent` to verify payload of `LOCK`
objects. Pass locked objects to `Locker` interface. Require from
`Locker.Lock` to return `apistatus.IrregularObjectLock` error on a
corresponding condition.
Also add error return to `DeleteHandler.DeleteObjects` method. Require
from method to return `apistatus.ObjectLocked` error on a corresponding
condition. Adopt implementations.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
`Inhume` operation can potentially mark lockers as garbage. There is a
need to update locker list in locked bucket.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `StorageEngine.Delete` to forward first encountered
`apistatus.ObjectLocked` error during shard processing.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `StorageEngine.Inhume` to forward first encountered
`apistatus.ObjectLocked` error during shard processing.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Implement `StorageEngine.Lock` method which works similar to `Inhume`
but calls `Lock` on the processing shards.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `DB.Lock` to return `apistatus.IrregularObjectLock` if at least one
of the locked objects is irregular (not of type REGULAR).
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `DB.Inhume` to return `apistatus.ObjectLocked` if at least one of
the inhumed objects is locked.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `DB.IterateCoveredByTombstones` to not pass locked objects to the
handler. The method is used by GC, therefore it will not consider locked
objects as candidates for deletion even if their tombstone is expired.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `DB.IterateExpired` to not pass locked objects to the handler. The
method is used by GC, therefore it will not consider them as candidates
for deletion.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
After introduction of LOCK objects (of type `TypeLock`) complicated
extended its behavior:
* create `lockers` container bucket (LCB) during PUT;
* remove object from LCB during DELETE;
* look up object in LCB during EXISTS;
* get object from LCB during GET;
* list objects from LCB during LIST with cursor;
* select objects from LCB during SELECT with '*'.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Implement `DB.Lock` method which marks list of the objects as locked by
another object. Only regular objects can be locked.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Create class of container buckets with `LOCKED` suffix. Put identifiers
of the objects of type `LOCK` to these buckets during `DB.Put`.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Remove `Object` and `RawObject` types from `pkg/core/object` package.
Use `Object` type from NeoFS SDK Go library everywhere. Avoid using the
deprecated elements.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Similarly to `Get`. Also fix a bug where `ErrNotFound` is returned
instead of `ErrRangeOutOfBounds`.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Metabase is expected to contain actual information about objects stored
in shard. If the object is present in metabase but is missing from
blobstor, peform an additional attempt to fetch it directly without
consulting metabase. Such a situation is unexpected, so error counter
is increased for the shard which has the object in the metabase. We
don't increase error counter for the shard which has the object in
blobstor, because some garbage can be expected there. In this
implementation there is no overhead for objects which are really
missing, i.e. are not present in any metabase.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
There are certain errors which are not expected during usual node
operation and which tell us that something is wrong with the shard.
To prevent possible data corruption, move shard in read-only mode after
amount of errors exceeded some threshold. By default no actions are performed.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
If pre-existing blobovnicza is initialized, it's size should be updated
even if all buckets are in place.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
It was added back in 2fb379b7 when we had many shard modes. Now we have
only two and comments for constants are rather descriptive.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
We could also ignore errors during evacuate, but this requires
unmarshaling objects first which slowers the process considerably.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
In read-only mode modifying operations are immediately returned with
error and all background operations are suspended.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Dump contains magic and a list of objects prefixed by object size in bytes.
We can't use proto-marshaled list because this requires having all dump
in memory. Using TAR induces 512 byte overhead for each object which can
be a problem in some cases.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
For some data compression makes little sense, as it is already compressed.
This commit allows to leave such data unchanged based on `Content-Type`
attribute. Currently exact, prefix and suffix matching are supported.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Provide shard mode information via `DumpInfo()`. Delete atomic field from
Shard structure since it duplicates new field.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Shard's mode was not used in the Node, so added only two modes whose roles
are clear. More modes will be added in the future.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
Make `flushBigObjects` routine to mark objects which are written to
`BlobStor`. This prevents already flushed objects from being written on
the next iterator tick.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
For fullness estimation of `Blobovnicza` we use number of object stored
in each size bucket. In previous implementation we multiplied the number
by the difference in bucket boundaries. This expression rather
estimated the minimum volume (and for the smallest bucket, the maximum)
of objects in the bucket.
Multiply number of objects by mean bucket size.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `syncFullnessCounter` to accept `bbolt.Tx` argument of Bolt
transaction within which counter should be synchronized. Pass
corresponding transaction during `Init`.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
According to BoltDB documentation bucket `value is only valid for the
life of the transaction`.
Make `DB.IsSmall` copy value slice in order to prevent potential memory
corruptions (e.g. `runtime.stringtobyteslice` cast).
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
ListWithCursor allows listing physically stored objects
from metabase with small chunks. Cursor tracks last
processed object, therefore new chunks are returned
on each request.
Signed-off-by: Alex Vanin <alexey@nspcc.ru>
Make `BlockExecution` / `ResumeExecution` to not release per-shard worker
pools. Make `StorageEngine.Close` to block these methods and any
data-related operations. It is still releases the pools.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Use `sync.Once` to prevent locks of stopping GC. It will also allow to
safely call `Shard.Close` multiple times.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
There is a need to disable execution of local data operation on storage
engine in runtime. If storage engine ops are blocked, node will act like
always but all local object operations will be denied.
Implement `BlockExecution` / `ResumeExecution` methods on `StorageEngine`
which blocks / resumes the execution of data ops. Wait for the completion of
all operations executed at the time of the call. Return error passed to
`BlockExecution` from all data-related methods until `ResumeExecution` call.
Make `Close` to block operations as well.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
a1696a8 introduced some logic which in some situations prevented big objects
to be persisted in FSTree. In this commit a refactoring is done with the
goal of simplifying the code and also checking #866 issue.
1. Split a monstrous function into multiple simple ones: memory objects
can only be small and for writing through the cache we can do a dispatch
in `Put` itself.
2. Determine objects to be put in database before the actual update
as setting up a transaction has non-zero overhead.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Container listing should not ignore tombstone and
storage group objects which are not stored in
primary buckets.
Signed-off-by: Alex Vanin <alexey@nspcc.ru>
Some of the pools are initialized during config initialization,
so it isn't possible currently to release them in one place.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Make `StorageEngine` to use non-blocking worker pools with the same
(configurable) size for PUT operation. This allows you to switch to using
more free shards when overloading others, thereby more evenly distributing
the write load.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
We should be able to read whatever we have written earlier.
Compression setting applies only to the new objects.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Do not log in options constructors. Also failure to
initialize compression module (possibly due to invalid options) is
certainly an error deserving proper treatment.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
There is a need to list addresses of the small objects stored in WriteCache
database.
Implement `IterateDB` function which accepts BoltDB instance and iterate
over all saved objects and passes their addresses to the hander.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
There is a need to open Blobovnicza instances in read-only mode in some
cases.
Add `ReadOnly` option. Do not create dir path in RO. Open underlying BoltDB
instance with ReadOnly flag. Document thal all writing operations should not
be called in ro (otherwise BoltDB txs fail).
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
`Blobovnicza` can be initialized with any number of range buckets, and
reconstructed with different size limit. In previous implementation
`Iterate` could miss some stored objects if we construct `Blobovnicza` with
smaller number of ranges.
Make `Iterate` to traverse all buckets regardless of current instance
bounds.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
In previous implementation `Blobovnicza.Iterate` op decoded object data only
and passed it to the handler. There is a need to iterate over all addresses
of the stored objects.
Add `DecodeAddresses` and `WithoutData` methods of `IteratePrm` type. Add
`Address` method to `IterationElement` type. Make `Iterate` to decode object
addresses if `DecodeAddress` was called and not read the data if
`WithoutData` was called. Implement `IterateAddresses` helper function to
simplify the code.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Each object from graveyard has tombstone or GC mark. If object has
tombstone, metabase should return `ErrAlreadyRemoved` on object requests.
This is the case when user clearly removed the object from container. GC
marks are used for physical removal which can appear even if object is still
presented in container (Control service, Policer job, etc.). In this case
metabase should return 404 error on object requests.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
`List` method of `Shard` must return only physically stored objects.
Use `AddPhyFilter` to select only phy objects.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Tombstone and "alive" objects can be both stored in BlobStor. They can
appear during iterating in different order. Metabase returns
`ErrAlreadyRemoved` error if object is inhumed.
Ignore `object.ErrAlreadyRemoved` errors of `metabase.Put`in Shard's
`refillMetabase` operation.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
There is a need to refill Metabase data with the objects from BlobStor.
Implement `refillMetabase` method which iterates over all objects from
BlobStor and saves them in Metabase.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
There is a need to be able to process all objects saved in `BlobStor`.
Implement `BlobStor.Iterate` method which iterates over all objects.
Implement `IterateBinaryObjects` and `IterateObjects` helper functions to
simplify the code.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
There is a need to be able to process all stored objects saved in
`Blobovnicza`.
Implement `Blobovnicza.Iterate` method which iterates over all objects.
Implement `IterateObjects` helper function to simplify the code.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
In the previous implementation of the metabase, there was no possibility of
reinitializing the metabase: clearing information about existing objects and
bringing it back to its initial state. This operation can be useful in
cases when the stored metadata about objects has lost (or possibly lost)
relevance, and you need to generate data from scratch. Also at the
initialization stage, static resources of the base were not created -
container-independent buckets.
Make `Metabase.Init` method to allocate graveyard, container-size and
to-move-it buckets in underlying BoltDB instance. Implement `Metabase.Reset`
method: it works like `Init` but clean up all static buckets and removes
other ones. Due to the logical similarity, the methods share a single piece
of code.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
There is a need to limit disk space used by write-cache. It is almost
impossible to calculate the value exactly. It is proposed to estimate the
size of the cache by the number of objects stored in it.
Track amounts of objects saved in DB and FSTree separately. To do this,
`ObjectCounters` interface is defined. It is generalized to a store of
numbers that can be made persistent (new option `WithObjectCounters`). By
default DB number is calculated as key number in default bucket, and FS
number is set same to DB since it is currently hard to read the actual value
from `FSTree` instance. Each PUT/DELETE operation to DB or FS
increases/decreases corresponding counter. Before each PUT op an overflow
check is performed with the following formula for evaluating the occupied
space: `NumDB * MaxDBSize + NumFS * MaxFSSize`. If next PUT can cause
write-cache overflow, object is written to the main storage.
By default maximum write-cache size is set to 1GB.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
There is a need to keep track of each local storage change. Log messages are
the most convenient way to do it.
Implement function which writes log message about the completed writing
operation in storage engine.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Shard should try to read object headers from write-cache if it is enabled.
Extend `writecache.Cache` interface with `Head` method. Call the method in
`Shard.Head` if `Shard.hasWriteCache` returns true.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Write cache should be able to execute HEAD operations according to spec.
Add simple implementation of `Head` method through the `Get` one. Leave
notes for future optimization.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Changes:
* replace `iotuil` elements with the ones from `os` package;
* replace `os.Filemode` with `fs.FileMode`;
* use `signal.NotifyContext` instead of `NewGracefulContext` (removed).
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
If object to be inhumed is root we need to continue first traverse over the
shards. In case when several children are stored in different shards,
inhuming object in a single shard leads to appearance of inhumed object in
subsequent selections. Also, any object can be already inhumed, and this
case is equivalent to successful inhume.
Do not fail on `object.ErrAlreadyRemoved` error. Continue first iterating
over shards if we detected root object (`SplitInfoError`).
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Write unit tests of `StorageEngine.Inhume` which assert that inhumed objects
don't appear in `Select` result.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Evicting from cache requires closing blobovnicza which
in turn needs to lock `activeMtx`. This lock is not needed on
every addition, but our LRU library doesn't return evicted keys.
In future we may consider switching to other implementation.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
This function already reused in different storage engine parts
so it makes sense to keep it in separate package.
Signed-off-by: Alex Vanin <alexey@nspcc.ru>
Different SplitInfo parts may be stored in different shards. Storage
engine must not stop at first SplitInfoError and should make
best effort to complete SplitInfo structure if needed.
Signed-off-by: Alex Vanin <alexey@nspcc.ru>
There were no unit tests of storage engine. This commit
adds first test to reproduce missing link ID in split info
at `engine.Head(raw)` request.
Engine tests uses some constructors from metabase tests,
so it is better to locate such functions in common
package at local_object_storage.
Signed-off-by: Alex Vanin <alexey@nspcc.ru>
`Inhume` operation can be performed on already deleted objects, and in this
case the entry will be added to the graveyard. `Delete` operation finishes
with error if object is not presented in metabase. However, the entry in the
cemetery must be deleted regardless of the presence of the object.
Additionally, now `Delete` does not return an error in the absence of an
object.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Metabase should not store payloads of objects. Make Put operation to cut
object payload before saving binary object in metabase.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>