Metrics of removed containers appear in the storage node #864

Closed
opened 2023-12-13 08:47:48 +00:00 by alexvanin · 7 comments
Owner

Expected Behavior

Storage node does not produce zero-value metric of container size after container removal.

Current Behavior

Storage node keeps zero-value metric of container size for all removed containers it stored some time ago.

Possible Solution

  1. If metabase can detect container removal operation, then adding DeleteContainerSizeMetric to EngineMetrics interface should be sufficient (similar to DeleteShardMetrics).

  2. Implement custom registry for engine metrics which will not Gather() zero-value metrics.

Steps to Reproduce (for bugs)

  1. Create container
  2. Upload an object
  3. Check frostfs_node_engine_container_size_bytes metric for created container, it should have non-zero value
  4. Remove object from container and wait until frostfs_node_engine_container_size_bytes returns zero value
  5. Remove container

Context

Billing software iterates over all containers from frostfs_node_engine_container_size_bytes metric and removed containers keep appearing in the list.

Regression

No.

Your Environment

FrostFS Storage v0.37.0

## Expected Behavior Storage node does not produce zero-value metric of container size after container removal. ## Current Behavior Storage node keeps zero-value metric of container size for all removed containers it stored some time ago. ## Possible Solution 1. If metabase can detect container removal operation, then adding `DeleteContainerSizeMetric` to `EngineMetrics` interface should be sufficient (similar to `DeleteShardMetrics`). 2. Implement custom registry for engine metrics which will not [Gather()](https://github.com/prometheus/client_golang/blob/fa1408ee351f6aba15c6d0207f7a0021eb3af406/prometheus/registry.go#L160) zero-value metrics. ## Steps to Reproduce (for bugs) 1. Create container 2. Upload an object 3. Check `frostfs_node_engine_container_size_bytes` metric for created container, it should have non-zero value 4. Remove object from container and wait until `frostfs_node_engine_container_size_bytes` returns zero value 5. Remove container ## Context Billing software iterates over all containers from `frostfs_node_engine_container_size_bytes` metric and removed containers keep appearing in the list. ## Regression No. ## Your Environment FrostFS Storage v0.37.0
alexvanin added the
bug
frostfs-node
labels 2023-12-13 08:47:48 +00:00
fyrchik was assigned by alexvanin 2023-12-13 08:47:48 +00:00
Owner

Not sure this is a bug -- container exists in the blockchain, the metric reflects node local state (not about container, but about objects of this container stored on this node).
Why is it a problem? Can we alter expectations of the client software?

Not sure this is a bug -- container exists in the blockchain, the metric reflects node local state (not about container, but about objects of this container stored on this node). Why is it a problem? Can we alter expectations of the client software?
Author
Owner

@fyrchik I think we can pivot expectation here, definitely not a blocker issue.

While blockchain state and local state differ, it may look strange for any logical metrics to appear after container removal, because container is not available anymore. Also there might be quite many removed containers, so number of zero-value metrics may become an issue.

Anyway, I get your point. Agree, that it is not quite obvious. Maybe @realloc has some insights on this behavior as well.

@fyrchik I think we can pivot expectation here, definitely not a blocker issue. While blockchain state and local state differ, it may look strange for any _logical_ metrics to appear after container removal, because container is not available anymore. Also there might be quite many removed containers, so number of zero-value metrics may become an issue. Anyway, I get your point. Agree, that it is not quite obvious. Maybe @realloc has some insights on this behavior as well.
Owner

0 metric is consistent with reality, container metrics should not serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container unless we put objects in it.
It seems easy to delete 0 metric values, but again, this could happen also for the containers that do exist.

0 metric is consistent with reality, container metrics _should not_ serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container _unless_ we put objects in it. It seems easy to delete 0 metric values, but again, this could happen also for the containers that _do_ exist.
Author
Owner

0 metric is consistent with reality, container metrics should not serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container unless we put objects in it.

That's a really good point. 👍

> 0 metric is consistent with reality, container metrics should not serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container unless we put objects in it. That's a really good point. 👍
Owner

I agree that the metrics reflect the internal state of the node, but it needs to be adjusted, at least periodically, by looking at a known source of truth. The chain is just such a source for us. The node reacts to container deletion events anyway. Perhaps we can somehow link these processes and clean up the metrics at the same time?

I agree that the metrics reflect the internal state of the node, but it needs to be adjusted, at least periodically, by looking at a known source of truth. The chain is just such a source for us. The node reacts to container deletion events anyway. Perhaps we can somehow link these processes and clean up the metrics at the same time?
fyrchik added this to the v0.38.0 milestone 2023-12-22 07:22:50 +00:00
fyrchik was unassigned by dstepanov-yadro 2023-12-22 12:31:41 +00:00
dstepanov-yadro self-assigned this 2023-12-22 12:31:41 +00:00
fyrchik self-assigned this 2023-12-22 12:31:43 +00:00
dstepanov-yadro was unassigned by fyrchik 2023-12-22 12:31:43 +00:00
fyrchik removed their assignment 2023-12-22 12:31:50 +00:00
dstepanov-yadro was assigned by fyrchik 2023-12-22 12:31:51 +00:00

Using custom registry for engine metrics which will not Gather() zero-value metrics is no different from zero-values filtering on the client side.

There are no suitable methods in the Prometheus library.

The metric frostfs_node_engine_container_size_bytes displays the size of containers for all shards. To correctly delete this metric, you need to make sure of the following:

  1. there are no objects belonging to this container on any shard (no shard will start changing this metric after deletion)
  2. the container has been deleted in the blockchain

It turns out that we need a garbage collector for metrics.

Also this metric and the associated metric of the number of objects in the container (frostfs_node_engine_container_objects_total) have high cardinality. For example, for 10 000 containers on 4 storage nodes we will have 160 000 metric values: 10 000 (containers count) x 4 (size, phy count, logic count, user count) x 4 (nodes count) = 160 000. It may be a problem for Prometheus.

Using `custom registry for engine metrics which will not Gather() zero-value metrics` is no different from zero-values filtering on the client side. There are no suitable methods in the Prometheus library. The metric `frostfs_node_engine_container_size_bytes` displays the size of containers for all shards. To correctly delete this metric, you need to make sure of the following: 1. there are no objects belonging to this container on any shard (no shard will start changing this metric after deletion) 2. the container has been deleted in the blockchain It turns out that we need a garbage collector for metrics. Also this metric and the associated metric of the number of objects in the container (`frostfs_node_engine_container_objects_total`) have high cardinality. For example, for 10 000 containers on 4 storage nodes we will have 160 000 metric values: 10 000 (containers count) x 4 (size, phy count, logic count, user count) x 4 (nodes count) = 160 000. It may be a problem for Prometheus.

Metrics frostfs_node_engine_container_size_byte and frostfs_node_engine_container_objects_total are now deleted after the container is deleted and objects are deleted. The cleanup occurs asynchronously once per epoch.

Metrics `frostfs_node_engine_container_size_byte` and `frostfs_node_engine_container_objects_total` are now deleted after the container is deleted and objects are deleted. The cleanup occurs asynchronously once per epoch.
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#864
No description provided.