Metrics of removed containers appear in the storage node #864
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#864
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Expected Behavior
Storage node does not produce zero-value metric of container size after container removal.
Current Behavior
Storage node keeps zero-value metric of container size for all removed containers it stored some time ago.
Possible Solution
If metabase can detect container removal operation, then adding
DeleteContainerSizeMetric
toEngineMetrics
interface should be sufficient (similar toDeleteShardMetrics
).Implement custom registry for engine metrics which will not Gather() zero-value metrics.
Steps to Reproduce (for bugs)
frostfs_node_engine_container_size_bytes
metric for created container, it should have non-zero valuefrostfs_node_engine_container_size_bytes
returns zero valueContext
Billing software iterates over all containers from
frostfs_node_engine_container_size_bytes
metric and removed containers keep appearing in the list.Regression
No.
Your Environment
FrostFS Storage v0.37.0
Not sure this is a bug -- container exists in the blockchain, the metric reflects node local state (not about container, but about objects of this container stored on this node).
Why is it a problem? Can we alter expectations of the client software?
@fyrchik I think we can pivot expectation here, definitely not a blocker issue.
While blockchain state and local state differ, it may look strange for any logical metrics to appear after container removal, because container is not available anymore. Also there might be quite many removed containers, so number of zero-value metrics may become an issue.
Anyway, I get your point. Agree, that it is not quite obvious. Maybe @realloc has some insights on this behavior as well.
0 metric is consistent with reality, container metrics should not serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container unless we put objects in it.
It seems easy to delete 0 metric values, but again, this could happen also for the containers that do exist.
That's a really good point. 👍
I agree that the metrics reflect the internal state of the node, but it needs to be adjusted, at least periodically, by looking at a known source of truth. The chain is just such a source for us. The node reacts to container deletion events anyway. Perhaps we can somehow link these processes and clean up the metrics at the same time?
Using
custom registry for engine metrics which will not Gather() zero-value metrics
is no different from zero-values filtering on the client side.There are no suitable methods in the Prometheus library.
The metric
frostfs_node_engine_container_size_bytes
displays the size of containers for all shards. To correctly delete this metric, you need to make sure of the following:It turns out that we need a garbage collector for metrics.
Also this metric and the associated metric of the number of objects in the container (
frostfs_node_engine_container_objects_total
) have high cardinality. For example, for 10 000 containers on 4 storage nodes we will have 160 000 metric values: 10 000 (containers count) x 4 (size, phy count, logic count, user count) x 4 (nodes count) = 160 000. It may be a problem for Prometheus.frostfs_node_engine_container_size_bytes
and..._count_total
metric for removed containers #889Metrics
frostfs_node_engine_container_size_byte
andfrostfs_node_engine_container_objects_total
are now deleted after the container is deleted and objects are deleted. The cleanup occurs asynchronously once per epoch.