[#370] Add tree service metrics #388

Merged
fyrchik merged 1 commit from ale64bit/frostfs-node:feature/370-metrics-tree-service into master 2023-05-26 13:39:14 +00:00
Member

Signed-off-by: Alejandro Lopez a.lopez@yadro.com
Close #370

Signed-off-by: Alejandro Lopez <a.lopez@yadro.com> Close #370
ale64bit requested review from storage-core-committers 2023-05-24 07:06:46 +00:00
ale64bit requested review from storage-core-developers 2023-05-24 07:06:47 +00:00
acid-ant approved these changes 2023-05-24 08:07:26 +00:00
dstepanov-yadro reviewed 2023-05-24 08:51:54 +00:00
@ -143,3 +145,4 @@
zap.String("err", err.Error()),
zap.Stringer("cid", op.cid),
zap.String("treeID", op.treeID))
s.metrics.IncReplicateErrors()

Why failed replicate operation duration is not measured?

Why failed replicate operation duration is not measured?
Author
Member

I was concerned that if some of them failed too quick, they would skew the distribution of those that succeeded.

I was concerned that if some of them failed too quick, they would skew the distribution of those that succeeded.

We can use labels.
There are best practices: https://prometheus.io/docs/practices/naming/#labels

Use labels to differentiate the characteristics of the thing that is being measured:

* api_http_requests_total - differentiate request types: operation="create|update|delete"

* api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"
We can use labels. There are best practices: https://prometheus.io/docs/practices/naming/#labels ``` Use labels to differentiate the characteristics of the thing that is being measured: * api_http_requests_total - differentiate request types: operation="create|update|delete" * api_request_duration_seconds - differentiate request stages: stage="extract|transform|load" ```
Author
Member

done. Thanks for the suggestion.

done. Thanks for the suggestion.
dstepanov-yadro marked this conversation as resolved
fyrchik reviewed 2023-05-24 10:28:47 +00:00
@ -28,11 +28,14 @@ type cfg struct {
cnrSource ContainerSource
eaclSource container.EACLSource
forest pilorama.Forest
Owner

unrelated to the commit.

unrelated to the commit.
Author
Member

done

done
fyrchik marked this conversation as resolved
@ -119,0 +122,4 @@
func WithMetrics(v MetricsRegister) Option {
return func(c *cfg) {
c.metrics = v
Owner

What about sane dummy non-null default?

What about sane dummy non-null default?
Author
Member

done

done
fyrchik marked this conversation as resolved
@ -70,6 +70,7 @@ func (s *Service) replicationWorker(ctx context.Context) {
case <-s.closeCh:
return
case task := <-s.replicationTasks:
s.metrics.IncReplicateTasks()
Owner

And here we measure each operation multiple times. Was in intended? Compare with AddReplicateDuration.
What about replacing that duration with ReplicateWaitTime, and using Count + Duration here?

And here we measure each operation multiple times. Was in intended? Compare with `AddReplicateDuration`. What about replacing that duration with `ReplicateWaitTime`, and using Count + Duration here?
Author
Member

done

done
fyrchik marked this conversation as resolved
@ -145,1 +147,4 @@
zap.String("treeID", op.treeID))
s.metrics.IncReplicateErrors()
} else {
s.metrics.AddReplicateDuration(time.Since(start))
Owner

Just to be aware: replicate does 3 things:

  1. Signs the operation.
  2. Calculates the list op nodes to replicate to.
  3. Sends the request.

While this metric is useful (how much time do we wait in queue), can we also measure the time spent in replicationWorker?

Just to be aware: `replicate` does 3 things: 1. Signs the operation. 2. Calculates the list op nodes to replicate to. 3. Sends the request. While this metric _is_ useful (how much time do we wait in queue), can we also measure the time spent in `replicationWorker`?
Author
Member

done

done
fyrchik marked this conversation as resolved
@ -365,6 +366,32 @@ func (s *Service) SynchronizeAll() error {
}
}
func (s *Service) sync(ctx context.Context) {
Owner

Again, what about separate commit?

Again, what about separate commit?
Author
Member

done

done
fyrchik marked this conversation as resolved
ale64bit force-pushed feature/370-metrics-tree-service from ee1aced7bd to 6d4047d00d 2023-05-24 11:55:59 +00:00 Compare
ale64bit force-pushed feature/370-metrics-tree-service from 6d4047d00d to 92f89d33b6 2023-05-25 07:16:08 +00:00 Compare
dstepanov-yadro reviewed 2023-05-25 13:10:23 +00:00
@ -0,0 +46,4 @@
Name: "replicate_wait_duration_seconds",
Help: "Duration of overall waiting time for replication loops",
}, []string{treeServiceLabelSuccess}),
syncOps: newCounter(prometheus.CounterOpts{

@ale64bit @fyrchik Maybe it's not worth separating operations and errors, but using labels everywhere? In fact, there is an operation that ended with an error or without an error.

@ale64bit @fyrchik Maybe it's not worth separating operations and errors, but using labels everywhere? In fact, there is an operation that ended with an error or without an error.
Author
Member

I didn't fully understand what you meant by "In fact, there is an operation that ended with an error or without an error."?

I didn't fully understand what you meant by *"In fact, there is an operation that ended with an error or without an error."*?

I'm sorry, I didn't make myself clear enough.
This code:

			case <-s.syncChan:

			s.metrics.IncSyncOps()

			cnrs, err := s.cfg.cnrSource.List()
			if err != nil {
				s.metrics.IncSyncErrors()
				return
			}

Operations and errors are counted separately. I think it should be a single counter with label separation.
If we need to display the number of success operations, then we should use such expression: irate(sync_operation[1m]) - irate(sync_operation_errors[1m]) instead of irate(sync_operation{success="true"}[1m])

I'm sorry, I didn't make myself clear enough. This code: ``` case <-s.syncChan: s.metrics.IncSyncOps() cnrs, err := s.cfg.cnrSource.List() if err != nil { s.metrics.IncSyncErrors() return } ``` Operations and errors are counted separately. I think it should be a single counter with label separation. If we need to display the number of success operations, then we should use such expression: `irate(sync_operation[1m]) - irate(sync_operation_errors[1m])` instead of `irate(sync_operation{success="true"}[1m])`
Author
Member

You mean that we should just use a single HistogramVec for operations in general? (with a success label). Since it includes a count of the observations as well.

You mean that we should just use a single `HistogramVec` for operations in general? (with a `success` label). Since it includes a count of the observations as well.

yes

yes
Author
Member

I agree it's better. Done.

I agree it's better. Done.
dstepanov-yadro marked this conversation as resolved
dstepanov-yadro reviewed 2023-05-25 13:41:04 +00:00
@ -382,2 +387,3 @@
s.metrics.IncSyncErrors()
span.End()
continue
return

why return?

why return?
Author
Member

fixed

fixed
dstepanov-yadro marked this conversation as resolved
ale64bit force-pushed feature/370-metrics-tree-service from 92f89d33b6 to 87cc5877ef 2023-05-25 14:03:47 +00:00 Compare
dstepanov-yadro approved these changes 2023-05-25 14:28:54 +00:00
fyrchik approved these changes 2023-05-26 09:10:05 +00:00
@ -9,9 +12,10 @@ type NodeMetrics struct {
engineMetrics
stateMetrics
replicatorMetrics
epoch metric[prometheus.Gauge]
Owner

Why did we move this line?

Why did we move this line?
Author
Member

I moved it to clearly split the embedded fields from the others, including this one.

I moved it to clearly split the embedded fields from the others, including this one.
@ -143,3 +148,4 @@
zap.String("err", err.Error()),
zap.Stringer("cid", op.cid),
zap.String("treeID", op.treeID))
s.metrics.AddReplicateWaitDuration(time.Since(start), false)
Owner

Can we unify 3 lines with s.metrics.AddReplicateWaitDuration(time.Since(start), err == nil)?

Can we unify 3 lines with `s.metrics.AddReplicateWaitDuration(time.Since(start), err == nil)`?
Author
Member

done

done
fyrchik marked this conversation as resolved
@ -386,10 +391,11 @@ func (s *Service) syncLoop(ctx context.Context) {
newMap, cnrsToSync := s.containersToSync(cnrs)
s.syncContainers(ctx, cnrsToSync)
Owner

Why this change?

Why this change?
Author
Member

the empty line seemed superfluous to me.

the empty line seemed superfluous to me.
@ -374,11 +375,15 @@ func (s *Service) syncLoop(ctx context.Context) {
return
case <-s.syncChan:
ctx, span := tracing.StartSpanFromContext(ctx, "TreeService.sync")
Owner

ahh

ahh
Author
Member

If by "ahh" you mean "I don't like this empty line", I removed it. I wish gofmt would eventually have a say on empty lines, to avoid wasting time discussing them.

If by "ahh" you mean "I don't like this empty line", I removed it. I wish `gofmt` would eventually have a say on empty lines, to avoid wasting time discussing them.
Owner

It's not about an empty line per se, it's about unrelated changes in the commit.

It's not about an empty line per se, it's about unrelated changes in the commit.
ale64bit force-pushed feature/370-metrics-tree-service from 87cc5877ef to bfded3a547 2023-05-26 10:34:16 +00:00 Compare
fyrchik approved these changes 2023-05-26 13:39:04 +00:00
fyrchik merged commit bc34fee6a7 into master 2023-05-26 13:39:14 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#388
No description provided.