[#370] Add tree service metrics #388

ale64bit · 2023-05-24T07:06:28Z

ale64bit commented

2023-05-24 07:06:28 +00:00

Signed-off-by: Alejandro Lopez a.lopez@yadro.com
Close #370

Signed-off-by: Alejandro Lopez <a.lopez@yadro.com> Close #370

requested reviews from storage-core-committers, storage-core-developers

2023-05-24 07:06:47 +00:00

acid-ant approved these changes 2023-05-24 08:07:26 +00:00

dstepanov-yadro reviewed 2023-05-24 08:51:54 +00:00

pkg/services/tree/replicator.go Outdated

					
				@ -143,3 +145,4 @@

									zap.String("err", err.Error()),

									zap.Stringer("cid", op.cid),

									zap.String("treeID", op.treeID))

								s.metrics.IncReplicateErrors()

dstepanov-yadro commented

2023-05-24 08:51:51 +00:00

Why failed replicate operation duration is not measured?

ale64bit commented

2023-05-24 08:58:56 +00:00

I was concerned that if some of them failed too quick, they would skew the distribution of those that succeeded.

dstepanov-yadro commented

2023-05-24 13:56:42 +00:00

We can use labels.
There are best practices: https://prometheus.io/docs/practices/naming/#labels

Use labels to differentiate the characteristics of the thing that is being measured:

* api_http_requests_total - differentiate request types: operation="create|update|delete"

* api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"

We can use labels. There are best practices: https://prometheus.io/docs/practices/naming/#labels ``` Use labels to differentiate the characteristics of the thing that is being measured: * api_http_requests_total - differentiate request types: operation="create|update|delete" * api_request_duration_seconds - differentiate request stages: stage="extract|transform|load" ```

ale64bit commented

2023-05-25 07:16:34 +00:00

done. Thanks for the suggestion.

dstepanov-yadro marked this conversation as resolved

fyrchik reviewed 2023-05-24 10:28:47 +00:00

pkg/services/tree/options.go Outdated

					
				@ -28,11 +28,14 @@ type cfg struct {

					cnrSource  ContainerSource

					eaclSource container.EACLSource

					forest     pilorama.Forest

fyrchik commented

2023-05-24 10:22:30 +00:00

unrelated to the commit.

ale64bit commented

2023-05-24 11:56:15 +00:00

done

fyrchik marked this conversation as resolved

pkg/services/tree/options.go Outdated

					
				@ -119,0 +122,4 @@

				func WithMetrics(v MetricsRegister) Option {

					return func(c *cfg) {

						c.metrics = v

fyrchik commented

2023-05-24 10:23:13 +00:00

What about sane dummy non-null default?

ale64bit commented

2023-05-24 11:56:20 +00:00

done

fyrchik marked this conversation as resolved

pkg/services/tree/replicator.go Outdated

					
				@ -70,6 +70,7 @@ func (s *Service) replicationWorker(ctx context.Context) {

						case <-s.closeCh:

							return

						case task := <-s.replicationTasks:

							s.metrics.IncReplicateTasks()

fyrchik commented

2023-05-24 10:28:45 +00:00

And here we measure each operation multiple times. Was in intended? Compare with AddReplicateDuration.
What about replacing that duration with ReplicateWaitTime, and using Count + Duration here?

And here we measure each operation multiple times. Was in intended? Compare with `AddReplicateDuration`. What about replacing that duration with `ReplicateWaitTime`, and using Count + Duration here?

ale64bit commented

2023-05-24 11:56:32 +00:00

done

fyrchik marked this conversation as resolved

pkg/services/tree/replicator.go Outdated

					
				@ -145,1 +147,4 @@

									zap.String("treeID", op.treeID))

								s.metrics.IncReplicateErrors()

							} else {

								s.metrics.AddReplicateDuration(time.Since(start))

fyrchik commented

2023-05-24 10:26:57 +00:00

Just to be aware: replicate does 3 things:

Signs the operation.
Calculates the list op nodes to replicate to.
Sends the request.

While this metric is useful (how much time do we wait in queue), can we also measure the time spent in replicationWorker?

Just to be aware: `replicate` does 3 things: 1. Signs the operation. 2. Calculates the list op nodes to replicate to. 3. Sends the request. While this metric _is_ useful (how much time do we wait in queue), can we also measure the time spent in `replicationWorker`?

ale64bit commented

2023-05-24 11:56:46 +00:00

done

fyrchik marked this conversation as resolved

pkg/services/tree/sync.go Outdated

					
				@ -365,6 +366,32 @@ func (s *Service) SynchronizeAll() error {

					}

				}

				func (s *Service) sync(ctx context.Context) {

fyrchik commented

2023-05-24 10:24:04 +00:00

Again, what about separate commit?

ale64bit commented

2023-05-24 11:56:52 +00:00

done

fyrchik marked this conversation as resolved

ale64bit force-pushed feature/370-metrics-tree-service from ee1aced7bd to 6d4047d00d

2023-05-24 11:55:59 +00:00

Compare

ale64bit force-pushed feature/370-metrics-tree-service from 6d4047d00d to 92f89d33b6

2023-05-25 07:16:08 +00:00

Compare

dstepanov-yadro reviewed 2023-05-25 13:10:23 +00:00

pkg/metrics/treeservice.go Outdated

					
				@ -0,0 +46,4 @@

							Name:      "replicate_wait_duration_seconds",

							Help:      "Duration of overall waiting time for replication loops",

						}, []string{treeServiceLabelSuccess}),

						syncOps: newCounter(prometheus.CounterOpts{

dstepanov-yadro commented

2023-05-25 13:10:19 +00:00

@ale64bit @fyrchik Maybe it's not worth separating operations and errors, but using labels everywhere? In fact, there is an operation that ended with an error or without an error.

ale64bit commented

2023-05-25 13:30:10 +00:00

I didn't fully understand what you meant by "In fact, there is an operation that ended with an error or without an error."?

I didn't fully understand what you meant by *"In fact, there is an operation that ended with an error or without an error."*?

dstepanov-yadro commented

2023-05-25 13:40:25 +00:00

I'm sorry, I didn't make myself clear enough.
This code:

			case <-s.syncChan:

			s.metrics.IncSyncOps()

			cnrs, err := s.cfg.cnrSource.List()
			if err != nil {
				s.metrics.IncSyncErrors()
				return
			}

Operations and errors are counted separately. I think it should be a single counter with label separation.
If we need to display the number of success operations, then we should use such expression: irate(sync_operation[1m]) - irate(sync_operation_errors[1m]) instead of irate(sync_operation{success="true"}[1m])

I'm sorry, I didn't make myself clear enough. This code: ``` case <-s.syncChan: s.metrics.IncSyncOps() cnrs, err := s.cfg.cnrSource.List() if err != nil { s.metrics.IncSyncErrors() return } ``` Operations and errors are counted separately. I think it should be a single counter with label separation. If we need to display the number of success operations, then we should use such expression: `irate(sync_operation[1m]) - irate(sync_operation_errors[1m])` instead of `irate(sync_operation{success="true"}[1m])`

ale64bit commented

2023-05-25 13:52:27 +00:00

You mean that we should just use a single HistogramVec for operations in general? (with a success label). Since it includes a count of the observations as well.

You mean that we should just use a single `HistogramVec` for operations in general? (with a `success` label). Since it includes a count of the observations as well.

dstepanov-yadro commented

2023-05-25 13:57:41 +00:00

yes

ale64bit commented

2023-05-25 14:04:35 +00:00

I agree it's better. Done.

dstepanov-yadro marked this conversation as resolved

dstepanov-yadro reviewed 2023-05-25 13:41:04 +00:00

pkg/services/tree/sync.go Outdated

					
				@ -382,2 +387,3 @@

								s.metrics.IncSyncErrors()

								span.End()

								continue

								return

dstepanov-yadro commented

2023-05-25 13:41:00 +00:00

why return?

ale64bit commented

2023-05-25 14:04:40 +00:00

fixed

dstepanov-yadro marked this conversation as resolved

ale64bit force-pushed feature/370-metrics-tree-service from 92f89d33b6 to 87cc5877ef

2023-05-25 14:03:47 +00:00

Compare

dstepanov-yadro approved these changes 2023-05-25 14:28:54 +00:00

fyrchik approved these changes 2023-05-26 09:10:05 +00:00

pkg/metrics/node.go Outdated

					
				@ -9,9 +12,10 @@ type NodeMetrics struct {

					engineMetrics

					stateMetrics

					replicatorMetrics

					epoch metric[prometheus.Gauge]

fyrchik commented

2023-05-26 09:07:59 +00:00

Why did we move this line?

ale64bit commented

2023-05-26 10:34:23 +00:00

I moved it to clearly split the embedded fields from the others, including this one.

pkg/services/tree/replicator.go Outdated

					
				@ -143,3 +148,4 @@

									zap.String("err", err.Error()),

									zap.Stringer("cid", op.cid),

									zap.String("treeID", op.treeID))

								s.metrics.AddReplicateWaitDuration(time.Since(start), false)

fyrchik commented

2023-05-26 09:09:34 +00:00

Can we unify 3 lines with s.metrics.AddReplicateWaitDuration(time.Since(start), err == nil)?

Can we unify 3 lines with `s.metrics.AddReplicateWaitDuration(time.Since(start), err == nil)`?

ale64bit commented

2023-05-26 10:34:35 +00:00

done

fyrchik marked this conversation as resolved

pkg/services/tree/sync.go Outdated

					
				@ -386,10 +391,11 @@ func (s *Service) syncLoop(ctx context.Context) {

							newMap, cnrsToSync := s.containersToSync(cnrs)

							s.syncContainers(ctx, cnrsToSync)

fyrchik commented

2023-05-26 09:09:48 +00:00

Why this change?

ale64bit commented

2023-05-26 10:34:55 +00:00

the empty line seemed superfluous to me.

pkg/services/tree/sync.go Outdated

					
				@ -374,11 +375,15 @@ func (s *Service) syncLoop(ctx context.Context) {

							return

						case <-s.syncChan:

							ctx, span := tracing.StartSpanFromContext(ctx, "TreeService.sync")

fyrchik commented

2023-05-26 09:09:53 +00:00

ahh

ale64bit commented

2023-05-26 10:35:42 +00:00

If by "ahh" you mean "I don't like this empty line", I removed it. I wish gofmt would eventually have a say on empty lines, to avoid wasting time discussing them.

If by "ahh" you mean "I don't like this empty line", I removed it. I wish `gofmt` would eventually have a say on empty lines, to avoid wasting time discussing them.

fyrchik commented

2023-05-26 10:48:28 +00:00

It's not about an empty line per se, it's about unrelated changes in the commit.