Move changes from the support branch. #58

Merged
fyrchik merged 25 commits from move-changes into master 2023-02-20 10:53:28 +00:00

25 commits

Author SHA1 Message Date
50e35de457 [#1868] Reload config for pprof and metrics on SIGHUP
Signed-off-by: Anton Nikiforov <an.nikiforov@yadro.com>
2023-02-17 12:29:36 +03:00
b679a724ef [#2260] node: Use a separate client cache for PUT service
Currently, under a mixed load one failed PUT can lead to closing
connection for all concurrent GETs. For PUT it does no harm: we have
many other nodes to choose from. For GET we are limited by `REP N`
factor, so in case of failover we can close the connection with the only
node posessing an object, which leads to failing the whole operation.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
0872d2aa49 [#2260] network/cache: Ignore clients only on Dial errors
The problem is that accidental timeout errors can make us to ignore
other nodes for some time. The primary purpose of the whole ignore
mechanism is not to degrade in case of failover. For this case,
closing connection and limiting the amount of dials is enough.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
c6a5e3f4ee [#2260] network/cache: Ignore context cancelled errors
Timeouts on client side should node affect inter-node communication.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
9c3a029941 [#2260] services/object: Do not assemble object with TTL=1
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
f1469626f5 [#2234] writecache: Fix possible panic in initFlushMarks
In case we have many small objects in the write-cache, `indices` should
not be reused between iterations.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
a09543c0e1 [#2252] fstree: Allow concurrent writes
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
Pavel Karpy
6c6319fc89 [#2164] node: Fix multi-client error reporting
Missing `ReportError` method did not allow casing multi-client interface to
`errorReporter` interface and dropping broken connections.
`replicationClient` embeds that interface, and it is widely used across
node's code. Embedded interface does not allow casting its parent structure
to `errorReporter` and breaks multi client error reporting logic.
Multi-client scheme is extremely hard to maintain, it makes unpredictable
casts and does not allow tracking code flow, so it will be refactored in the
future anyway.

Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
2023-02-17 12:24:23 +03:00
Pavel Karpy
31b011f4ae [#2244] node: Fix subscriptions lock
Subscribing without async listening could lead to a dead-lock in the
`neo-go` client.

Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
2023-02-17 12:24:23 +03:00
Pavel Karpy
6afe96a171 [#2244] node: Add object address to WC's operations
Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
2023-02-17 12:24:23 +03:00
Pavel Karpy
90476d3bad [#2244] node: Update expired storage ID by WC
Previously, node could get an "infinite" small object: it could be expired
and thus could not be flushed (update its storage ID) to metabase => could
not be marked as flushed => node never removes such object and repeat all
the cycle one more time. If object exists and is not marked with GC (meta
returns `ErrObjectIsExpired`, not `ObjectNotFound` and not
`ObjectAlreadyRemoved`), its ID is safe to update _in the same_ bbolt
transaction.

Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
2023-02-17 12:24:23 +03:00
889702b4d5 [#2246] node: Allow to configure tombsone lifetime
Currently, DELETE service sets tombstone expiration epoch to
`current epoch + 5`. This works less than ideal in private networks
where an epoch can be e.g. 10 minutes. In this case, after a node is
unavailable for more than 1 hour, already deleted objects have a chance
to reappear.

After this commit tombstone lifetime can be configured.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
64350f7b0f [#2241] metrics: Fix request count metrics names
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
d213461bf8 [#2238] engine: Add test for component initialization failures
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
b2159b9608 [#2238] engine: Add test for component initialization failures
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
1fe9a650c7 [#2238] neofs-node: Gracefully handle shard initialization errors
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
53067a7db0 [#2238] shard: Try closing all components
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
9a61e53162 [#2238] engine: Make Open and Init similar
1. Both could initialize shards in parallel.
2. Both should close shards after an error.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
75de3f1c1f [#2239] writecache: Fix possible deadlock
LRU `Peek`/`Contains` take LRU mutex _inside_ of a `View` transaction.
`View` transaction itself takes `mmapLock` [1], which is lifted after tx
finishes (in `tx.Commit()` -> `tx.close()` -> `tx.db.removeTx`)

When we evict items from LRU cache mutex order is different:
first we take LRU mutex and then execute `Batch` which _does_ take
`mmapLock` in case we need to remap. Thus the deadlock.

[1] 8f4a7e1f92/db.go (L708)

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
de344e9223 [#2232] pilorama: Merge in-queue batches
To achieve high performance we must choose proper values for both
batch size and delay. For user operations we want to set low delay.
However it would prevent tree synchronization operations to form big
enough batches. For these operations, batching gives the most benefit
not only in terms of on-CPU execution cost, but also by speeding up
transaction persist (`fsync`).
In this commit we try merging batches that are already
_triggered_, but not yet _started to execute_. This way we can still
query batches for execution after the provided delay while also allowing
multiple formed batches to execute faster.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
f1e3309ca3 [#2224] adm: Use native neo-go sessions in dump-hashes
If we had lots of domains in one zone, `dump-hashes` for all others
can miss some domains, because we need to restrict ourselves with _some_
number.
In this commit we use neo-go sessions by default, with a proper
failback to in-script iterator unwrapping.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:23 +03:00
Pavel Karpy
e17bf71d19 [#2213] node: Do not return object expired object
"Object is expired" means that object is presented in `meta` but it is not
`ObjectNotFound` error. Previous implementation made `shard` search for an
object without `meta` which was an error.

Signed-off-by: Pavel Karpy <p.karpy@yadro.com>
2023-02-17 12:24:23 +03:00
Roman Khimov
7b0708f50b CHANGELOG: add more fancy glyphs
How could you forget adding it?

Signed-off-by: Roman Khimov <roman@nspcc.ru>
2023-02-17 12:24:23 +03:00
Roman Khimov
3b38aedb38 CHANGELOG: fix whitespacing errors
Signed-off-by: Roman Khimov <roman@nspcc.ru>
2023-02-17 12:24:23 +03:00
1f01a0a71a [#2212] morph: Fix subscription restoration
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
2023-02-17 12:24:22 +03:00