All error counting and hangling logic is present on the engine level.
Currently, we pass engine metrics with shard ID metric to shard, then
export 3 methods to manipulate these metrics.
In this commits all methods are removed and error counter is tracked on
the engine level exlusively.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
- `reportShardErrorBackground()` no longer differs from
`reportShardError()`, reflect this in its name;
- reuse common pieces of code to make it simpler.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Concurrent initialization in case of the metabase resync leads to
high memory consumption and potential OOM.
Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
Now it's possible to run evacuate shard in async.
Also only one evacuate process can be in progress.
Signed-off-by: Dmitrii Stepanov <d.stepanov@yadro.com>
RemoveDuplicates() removes all duplicate object copies stored on
multiple shards. All shards are processed and the command tries to leave
a copy on the best shard according to HRW.
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
All logic errors are wrapped in `logicerr.Logical` type and do not
affect shard error counter.
Signed-off-by: Evgenii Stratonikov <evgeniy@morphbits.ru>
Make it store its internal `zap.Logger`'s level. Also, make all the
components to accept internal `logger.Logger` instead of `zap.Logger`; it
will simplify future refactor.
Signed-off-by: Pavel Karpy <carpawell@nspcc.ru>
`Degraded` mode can be set by the administrator if needed.
Modifying operations in this mode can lead node into an inconsistent state
because metabase checks such as lock checking are not performed.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
`Degraded` mode is set automatically after error counter is over the
threshold. `ReadOnly` mode can still be set by an administrator.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
There are certain errors which are not expected during usual node
operation and which tell us that something is wrong with the shard.
To prevent possible data corruption, move shard in read-only mode after
amount of errors exceeded some threshold. By default no actions are performed.
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
There is a need to disable execution of local data operation on storage
engine in runtime. If storage engine ops are blocked, node will act like
always but all local object operations will be denied.
Implement `BlockExecution` / `ResumeExecution` methods on `StorageEngine`
which blocks / resumes the execution of data ops. Wait for the completion of
all operations executed at the time of the call. Return error passed to
`BlockExecution` from all data-related methods until `ResumeExecution` call.
Make `Close` to block operations as well.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>
Make `StorageEngine` to use non-blocking worker pools with the same
(configurable) size for PUT operation. This allows you to switch to using
more free shards when overloading others, thereby more evenly distributing
the write load.
Signed-off-by: Leonard Lyubich <leonard@nspcc.ru>