[#370] Add tree service metrics #388

Merged
fyrchik merged 1 commits from ale64bit/frostfs-node:feature/370-metrics-tree-service into master 2023-05-26 13:39:14 +00:00
Collaborator

Signed-off-by: Alejandro Lopez a.lopez@yadro.com
Close #370

Signed-off-by: Alejandro Lopez <a.lopez@yadro.com> Close #370
ale64bit requested review from storage-core-committers 2023-05-24 07:06:46 +00:00
ale64bit requested review from storage-core-developers 2023-05-24 07:06:47 +00:00
acid-ant approved these changes 2023-05-24 08:07:26 +00:00
dstepanov-yadro reviewed 2023-05-24 08:51:54 +00:00
@ -143,3 +145,4 @@
zap.String("err", err.Error()),
zap.Stringer("cid", op.cid),
zap.String("treeID", op.treeID))
s.metrics.IncReplicateErrors()

Why failed replicate operation duration is not measured?

Why failed replicate operation duration is not measured?
Poster
Collaborator

I was concerned that if some of them failed too quick, they would skew the distribution of those that succeeded.

I was concerned that if some of them failed too quick, they would skew the distribution of those that succeeded.

We can use labels.
There are best practices: https://prometheus.io/docs/practices/naming/#labels

Use labels to differentiate the characteristics of the thing that is being measured:

* api_http_requests_total - differentiate request types: operation="create|update|delete"

* api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"
We can use labels. There are best practices: https://prometheus.io/docs/practices/naming/#labels ``` Use labels to differentiate the characteristics of the thing that is being measured: * api_http_requests_total - differentiate request types: operation="create|update|delete" * api_request_duration_seconds - differentiate request stages: stage="extract|transform|load" ```
Poster
Collaborator

done. Thanks for the suggestion.

done. Thanks for the suggestion.
dstepanov-yadro marked this conversation as resolved
fyrchik reviewed 2023-05-24 10:28:47 +00:00
@ -28,11 +28,14 @@ type cfg struct {
cnrSource ContainerSource
eaclSource container.EACLSource
forest pilorama.Forest

unrelated to the commit.

unrelated to the commit.
Poster
Collaborator

done

done
fyrchik marked this conversation as resolved
@ -119,0 +122,4 @@
func WithMetrics(v MetricsRegister) Option {
return func(c *cfg) {
c.metrics = v

What about sane dummy non-null default?

What about sane dummy non-null default?
Poster
Collaborator

done

done
fyrchik marked this conversation as resolved
@ -70,6 +70,7 @@ func (s *Service) replicationWorker(ctx context.Context) {
case <-s.closeCh:
return
case task := <-s.replicationTasks:
s.metrics.IncReplicateTasks()

And here we measure each operation multiple times. Was in intended? Compare with AddReplicateDuration.
What about replacing that duration with ReplicateWaitTime, and using Count + Duration here?

And here we measure each operation multiple times. Was in intended? Compare with `AddReplicateDuration`. What about replacing that duration with `ReplicateWaitTime`, and using Count + Duration here?
Poster
Collaborator

done

done
fyrchik marked this conversation as resolved
@ -145,1 +147,4 @@
zap.String("treeID", op.treeID))
s.metrics.IncReplicateErrors()
} else {
s.metrics.AddReplicateDuration(time.Since(start))

Just to be aware: replicate does 3 things:

  1. Signs the operation.
  2. Calculates the list op nodes to replicate to.
  3. Sends the request.

While this metric is useful (how much time do we wait in queue), can we also measure the time spent in replicationWorker?

Just to be aware: `replicate` does 3 things: 1. Signs the operation. 2. Calculates the list op nodes to replicate to. 3. Sends the request. While this metric _is_ useful (how much time do we wait in queue), can we also measure the time spent in `replicationWorker`?
Poster
Collaborator

done

done
fyrchik marked this conversation as resolved
@ -365,6 +366,32 @@ func (s *Service) SynchronizeAll() error {
}
}
func (s *Service) sync(ctx context.Context) {

Again, what about separate commit?

Again, what about separate commit?
Poster
Collaborator

done

done
fyrchik marked this conversation as resolved
ale64bit force-pushed feature/370-metrics-tree-service from ee1aced7bd to 6d4047d00d 2023-05-24 11:55:59 +00:00 Compare
ale64bit force-pushed feature/370-metrics-tree-service from 6d4047d00d to 92f89d33b6 2023-05-25 07:16:08 +00:00 Compare
dstepanov-yadro reviewed 2023-05-25 13:10:23 +00:00
@ -0,0 +46,4 @@
Name: "replicate_wait_duration_seconds",
Help: "Duration of overall waiting time for replication loops",
}, []string{treeServiceLabelSuccess}),
syncOps: newCounter(prometheus.CounterOpts{

@ale64bit @fyrchik Maybe it's not worth separating operations and errors, but using labels everywhere? In fact, there is an operation that ended with an error or without an error.

@ale64bit @fyrchik Maybe it's not worth separating operations and errors, but using labels everywhere? In fact, there is an operation that ended with an error or without an error.
Poster
Collaborator

I didn't fully understand what you meant by "In fact, there is an operation that ended with an error or without an error."?

I didn't fully understand what you meant by *"In fact, there is an operation that ended with an error or without an error."*?

I'm sorry, I didn't make myself clear enough.
This code:

			case <-s.syncChan:

			s.metrics.IncSyncOps()

			cnrs, err := s.cfg.cnrSource.List()
			if err != nil {
				s.metrics.IncSyncErrors()
				return
			}

Operations and errors are counted separately. I think it should be a single counter with label separation.
If we need to display the number of success operations, then we should use such expression: irate(sync_operation[1m]) - irate(sync_operation_errors[1m]) instead of irate(sync_operation{success="true"}[1m])

I'm sorry, I didn't make myself clear enough. This code: ``` case <-s.syncChan: s.metrics.IncSyncOps() cnrs, err := s.cfg.cnrSource.List() if err != nil { s.metrics.IncSyncErrors() return } ``` Operations and errors are counted separately. I think it should be a single counter with label separation. If we need to display the number of success operations, then we should use such expression: `irate(sync_operation[1m]) - irate(sync_operation_errors[1m])` instead of `irate(sync_operation{success="true"}[1m])`
Poster
Collaborator

You mean that we should just use a single HistogramVec for operations in general? (with a success label). Since it includes a count of the observations as well.

You mean that we should just use a single `HistogramVec` for operations in general? (with a `success` label). Since it includes a count of the observations as well.

yes

yes
Poster
Collaborator

I agree it's better. Done.

I agree it's better. Done.
dstepanov-yadro marked this conversation as resolved
dstepanov-yadro reviewed 2023-05-25 13:41:04 +00:00
@ -382,2 +387,3 @@
s.metrics.IncSyncErrors()
span.End()
continue
return

why return?

why return?
Poster
Collaborator

fixed

fixed
dstepanov-yadro marked this conversation as resolved
ale64bit force-pushed feature/370-metrics-tree-service from 92f89d33b6 to 87cc5877ef 2023-05-25 14:03:47 +00:00 Compare
dstepanov-yadro approved these changes 2023-05-25 14:28:54 +00:00
fyrchik approved these changes 2023-05-26 09:10:05 +00:00
@ -9,9 +12,10 @@ type NodeMetrics struct {
engineMetrics
stateMetrics
replicatorMetrics
epoch metric[prometheus.Gauge]

Why did we move this line?

Why did we move this line?
Poster
Collaborator

I moved it to clearly split the embedded fields from the others, including this one.

I moved it to clearly split the embedded fields from the others, including this one.
@ -143,3 +148,4 @@
zap.String("err", err.Error()),
zap.Stringer("cid", op.cid),
zap.String("treeID", op.treeID))
s.metrics.AddReplicateWaitDuration(time.Since(start), false)

Can we unify 3 lines with s.metrics.AddReplicateWaitDuration(time.Since(start), err == nil)?

Can we unify 3 lines with `s.metrics.AddReplicateWaitDuration(time.Since(start), err == nil)`?
Poster
Collaborator

done

done
fyrchik marked this conversation as resolved
@ -386,10 +391,11 @@ func (s *Service) syncLoop(ctx context.Context) {
newMap, cnrsToSync := s.containersToSync(cnrs)
s.syncContainers(ctx, cnrsToSync)

Why this change?

Why this change?
Poster
Collaborator

the empty line seemed superfluous to me.

the empty line seemed superfluous to me.
@ -374,11 +375,15 @@ func (s *Service) syncLoop(ctx context.Context) {
return
case <-s.syncChan:
ctx, span := tracing.StartSpanFromContext(ctx, "TreeService.sync")

ahh

ahh
Poster
Collaborator

If by "ahh" you mean "I don't like this empty line", I removed it. I wish gofmt would eventually have a say on empty lines, to avoid wasting time discussing them.

If by "ahh" you mean "I don't like this empty line", I removed it. I wish `gofmt` would eventually have a say on empty lines, to avoid wasting time discussing them.

It's not about an empty line per se, it's about unrelated changes in the commit.

It's not about an empty line per se, it's about unrelated changes in the commit.
ale64bit force-pushed feature/370-metrics-tree-service from 87cc5877ef to bfded3a547 2023-05-26 10:34:16 +00:00 Compare
fyrchik approved these changes 2023-05-26 13:39:04 +00:00
fyrchik merged commit bc34fee6a7 into master 2023-05-26 13:39:14 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#388
There is no content yet.