[#370] Add tree service metrics #388
No reviewers
Labels
No labels
P0
P1
P2
P3
badger
frostfs-adm
frostfs-cli
frostfs-ir
frostfs-lens
frostfs-node
good first issue
triage
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/frostfs-node#388
Loading…
Reference in a new issue
No description provided.
Delete branch "ale64bit/frostfs-node:feature/370-metrics-tree-service"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Signed-off-by: Alejandro Lopez a.lopez@yadro.com
Close #370
@ -143,3 +145,4 @@
zap.String("err", err.Error()),
zap.Stringer("cid", op.cid),
zap.String("treeID", op.treeID))
s.metrics.IncReplicateErrors()
Why failed replicate operation duration is not measured?
I was concerned that if some of them failed too quick, they would skew the distribution of those that succeeded.
We can use labels.
There are best practices: https://prometheus.io/docs/practices/naming/#labels
done. Thanks for the suggestion.
@ -28,11 +28,14 @@ type cfg struct {
cnrSource ContainerSource
eaclSource container.EACLSource
forest pilorama.Forest
unrelated to the commit.
done
@ -119,0 +122,4 @@
func WithMetrics(v MetricsRegister) Option {
return func(c *cfg) {
c.metrics = v
What about sane dummy non-null default?
done
@ -70,6 +70,7 @@ func (s *Service) replicationWorker(ctx context.Context) {
case <-s.closeCh:
return
case task := <-s.replicationTasks:
s.metrics.IncReplicateTasks()
And here we measure each operation multiple times. Was in intended? Compare with
AddReplicateDuration
.What about replacing that duration with
ReplicateWaitTime
, and using Count + Duration here?done
@ -145,1 +147,4 @@
zap.String("treeID", op.treeID))
s.metrics.IncReplicateErrors()
} else {
s.metrics.AddReplicateDuration(time.Since(start))
Just to be aware:
replicate
does 3 things:While this metric is useful (how much time do we wait in queue), can we also measure the time spent in
replicationWorker
?done
@ -365,6 +366,32 @@ func (s *Service) SynchronizeAll() error {
}
}
func (s *Service) sync(ctx context.Context) {
Again, what about separate commit?
done
ee1aced7bd
to6d4047d00d
6d4047d00d
to92f89d33b6
@ -0,0 +46,4 @@
Name: "replicate_wait_duration_seconds",
Help: "Duration of overall waiting time for replication loops",
}, []string{treeServiceLabelSuccess}),
syncOps: newCounter(prometheus.CounterOpts{
@ale64bit @fyrchik Maybe it's not worth separating operations and errors, but using labels everywhere? In fact, there is an operation that ended with an error or without an error.
I didn't fully understand what you meant by "In fact, there is an operation that ended with an error or without an error."?
I'm sorry, I didn't make myself clear enough.
This code:
Operations and errors are counted separately. I think it should be a single counter with label separation.
If we need to display the number of success operations, then we should use such expression:
irate(sync_operation[1m]) - irate(sync_operation_errors[1m])
instead ofirate(sync_operation{success="true"}[1m])
You mean that we should just use a single
HistogramVec
for operations in general? (with asuccess
label). Since it includes a count of the observations as well.yes
I agree it's better. Done.
@ -382,2 +387,3 @@
s.metrics.IncSyncErrors()
span.End()
continue
return
why return?
fixed
92f89d33b6
to87cc5877ef
@ -9,9 +12,10 @@ type NodeMetrics struct {
engineMetrics
stateMetrics
replicatorMetrics
epoch metric[prometheus.Gauge]
Why did we move this line?
I moved it to clearly split the embedded fields from the others, including this one.
@ -143,3 +148,4 @@
zap.String("err", err.Error()),
zap.Stringer("cid", op.cid),
zap.String("treeID", op.treeID))
s.metrics.AddReplicateWaitDuration(time.Since(start), false)
Can we unify 3 lines with
s.metrics.AddReplicateWaitDuration(time.Since(start), err == nil)
?done
@ -386,10 +391,11 @@ func (s *Service) syncLoop(ctx context.Context) {
newMap, cnrsToSync := s.containersToSync(cnrs)
s.syncContainers(ctx, cnrsToSync)
Why this change?
the empty line seemed superfluous to me.
@ -374,11 +375,15 @@ func (s *Service) syncLoop(ctx context.Context) {
return
case <-s.syncChan:
ctx, span := tracing.StartSpanFromContext(ctx, "TreeService.sync")
ahh
If by "ahh" you mean "I don't like this empty line", I removed it. I wish
gofmt
would eventually have a say on empty lines, to avoid wasting time discussing them.It's not about an empty line per se, it's about unrelated changes in the commit.
87cc5877ef
tobfded3a547