Metrics of removed containers appear in the storage node

alexvanin commented

2023-12-13 08:47:48 +00:00

Owner

Expected Behavior

Storage node does not produce zero-value metric of container size after container removal.

Current Behavior

Storage node keeps zero-value metric of container size for all removed containers it stored some time ago.

Possible Solution

If metabase can detect container removal operation, then adding DeleteContainerSizeMetric to EngineMetrics interface should be sufficient (similar to DeleteShardMetrics).
Implement custom registry for engine metrics which will not Gather() zero-value metrics.

Steps to Reproduce (for bugs)

Create container
Upload an object
Check frostfs_node_engine_container_size_bytes metric for created container, it should have non-zero value
Remove object from container and wait until frostfs_node_engine_container_size_bytes returns zero value
Remove container

Context

Billing software iterates over all containers from frostfs_node_engine_container_size_bytes metric and removed containers keep appearing in the list.

Regression

No.

Your Environment

FrostFS Storage v0.37.0

## Expected Behavior Storage node does not produce zero-value metric of container size after container removal. ## Current Behavior Storage node keeps zero-value metric of container size for all removed containers it stored some time ago. ## Possible Solution 1. If metabase can detect container removal operation, then adding `DeleteContainerSizeMetric` to `EngineMetrics` interface should be sufficient (similar to `DeleteShardMetrics`). 2. Implement custom registry for engine metrics which will not [Gather()](https://github.com/prometheus/client_golang/blob/fa1408ee351f6aba15c6d0207f7a0021eb3af406/prometheus/registry.go#L160) zero-value metrics. ## Steps to Reproduce (for bugs) 1. Create container 2. Upload an object 3. Check `frostfs_node_engine_container_size_bytes` metric for created container, it should have non-zero value 4. Remove object from container and wait until `frostfs_node_engine_container_size_bytes` returns zero value 5. Remove container ## Context Billing software iterates over all containers from `frostfs_node_engine_container_size_bytes` metric and removed containers keep appearing in the list. ## Regression No. ## Your Environment FrostFS Storage v0.37.0

alexvanin added the

bug

frostfs-node

labels 2023-12-13 08:47:48 +00:00

fyrchik was assigned by alexvanin

2023-12-13 08:47:48 +00:00

fyrchik commented

2023-12-13 09:19:18 +00:00

Owner

Not sure this is a bug -- container exists in the blockchain, the metric reflects node local state (not about container, but about objects of this container stored on this node).
Why is it a problem? Can we alter expectations of the client software?

Not sure this is a bug -- container exists in the blockchain, the metric reflects node local state (not about container, but about objects of this container stored on this node). Why is it a problem? Can we alter expectations of the client software?

alexvanin commented

2023-12-13 09:35:12 +00:00

Author

Owner

@fyrchik I think we can pivot expectation here, definitely not a blocker issue.

While blockchain state and local state differ, it may look strange for any logical metrics to appear after container removal, because container is not available anymore. Also there might be quite many removed containers, so number of zero-value metrics may become an issue.

Anyway, I get your point. Agree, that it is not quite obvious. Maybe @realloc has some insights on this behavior as well.

@fyrchik I think we can pivot expectation here, definitely not a blocker issue. While blockchain state and local state differ, it may look strange for any _logical_ metrics to appear after container removal, because container is not available anymore. Also there might be quite many removed containers, so number of zero-value metrics may become an issue. Anyway, I get your point. Agree, that it is not quite obvious. Maybe @realloc has some insights on this behavior as well.

fyrchik commented

2023-12-13 12:34:32 +00:00

Owner

0 metric is consistent with reality, container metrics should not serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container unless we put objects in it.
It seems easy to delete 0 metric values, but again, this could happen also for the containers that do exist.

0 metric is consistent with reality, container metrics _should not_ serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container _unless_ we put objects in it. It seems easy to delete 0 metric values, but again, this could happen also for the containers that _do_ exist.

alexvanin commented

2023-12-13 13:08:55 +00:00

Author

Owner

0 metric is consistent with reality, container metrics should not serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container unless we put objects in it.

That's a really good point. 👍

> 0 metric is consistent with reality, container metrics should not serve as a sing of the existence of a container. An an example, node won't export any metrics for a newly created container unless we put objects in it. That's a really good point. 👍

realloc commented

2023-12-14 09:07:53 +00:00

Owner

I agree that the metrics reflect the internal state of the node, but it needs to be adjusted, at least periodically, by looking at a known source of truth. The chain is just such a source for us. The node reacts to container deletion events anyway. Perhaps we can somehow link these processes and clean up the metrics at the same time?

fyrchik added this to the v0.38.0 milestone 2023-12-22 07:22:50 +00:00

fyrchik was unassigned by dstepanov-yadro

2023-12-22 12:31:41 +00:00

dstepanov-yadro self-assigned this 2023-12-22 12:31:41 +00:00

fyrchik self-assigned this 2023-12-22 12:31:43 +00:00

dstepanov-yadro was unassigned by fyrchik

2023-12-22 12:31:43 +00:00

fyrchik removed their assignment 2023-12-22 12:31:50 +00:00

dstepanov-yadro was assigned by fyrchik

2023-12-22 12:31:51 +00:00

dstepanov-yadro commented

2023-12-25 13:34:34 +00:00

Member

Using custom registry for engine metrics which will not Gather() zero-value metrics is no different from zero-values filtering on the client side.

There are no suitable methods in the Prometheus library.

The metric frostfs_node_engine_container_size_bytes displays the size of containers for all shards. To correctly delete this metric, you need to make sure of the following:

there are no objects belonging to this container on any shard (no shard will start changing this metric after deletion)
the container has been deleted in the blockchain

It turns out that we need a garbage collector for metrics.

Also this metric and the associated metric of the number of objects in the container (frostfs_node_engine_container_objects_total) have high cardinality. For example, for 10 000 containers on 4 storage nodes we will have 160 000 metric values: 10 000 (containers count) x 4 (size, phy count, logic count, user count) x 4 (nodes count) = 160 000. It may be a problem for Prometheus.

Using `custom registry for engine metrics which will not Gather() zero-value metrics` is no different from zero-values filtering on the client side. There are no suitable methods in the Prometheus library. The metric `frostfs_node_engine_container_size_bytes` displays the size of containers for all shards. To correctly delete this metric, you need to make sure of the following: 1. there are no objects belonging to this container on any shard (no shard will start changing this metric after deletion) 2. the container has been deleted in the blockchain It turns out that we need a garbage collector for metrics. Also this metric and the associated metric of the number of objects in the container (`frostfs_node_engine_container_objects_total`) have high cardinality. For example, for 10 000 containers on 4 storage nodes we will have 160 000 metric values: 10 000 (containers count) x 4 (size, phy count, logic count, user count) x 4 (nodes count) = 160 000. It may be a problem for Prometheus.

dstepanov-yadro referenced this issue

2023-12-27 17:13:47 +00:00

Drop frostfs_node_engine_container_size_bytes and ..._count_total metric for removed containers #889

fyrchik referenced this issue from a commit

2024-01-10 08:33:55 +00:00

[#864] metabase: Refactor delete/inhume

fyrchik referenced this issue from a commit

2024-01-10 08:33:55 +00:00

[#864] engine: Drop container size metric if container deleted

fyrchik referenced this issue from a commit

2024-01-10 08:33:55 +00:00

[#864] engine: Drop container count metric if container removed

dstepanov-yadro commented

2024-01-10 10:50:28 +00:00

Member

Metrics frostfs_node_engine_container_size_byte and frostfs_node_engine_container_objects_total are now deleted after the container is deleted and objects are deleted. The cleanup occurs asynchronously once per epoch.

Metrics `frostfs_node_engine_container_size_byte` and `frostfs_node_engine_container_objects_total` are now deleted after the container is deleted and objects are deleted. The cleanup occurs asynchronously once per epoch.

dstepanov-yadro closed this issue

2024-01-10 10:50:28 +00:00

anikeev-yadro referenced this issue

2024-01-29 14:37:59 +00:00

Metric frostfs_node_engine_container_size_bytes appears after container was removed #938

Rows
Columns

Metrics of removed containers appear in the storage node #864

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Regression

Your Environment