Improve debug logging for APE check failures #1655

Open
aarifullin wants to merge 6 commits from aarifullin/frostfs-node:feat/improve_ape_log into master
Member

This pull-request is a compromise between two suggestion #1 and #2

  1. Request and Resource types return properties with Properties()
  2. ChainRouterError is extended. The error message has not been changed, but the type provides more fields
  3. Object service is logging all related to ape check info in case of ChainRouterError occurred
  4. Tree service is logging all related to ape check info in case of ChainRouterError occurred

Close #1574

This pull-request is a compromise between two suggestion [#1](https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/1574#issue-4876) and [#2](https://git.frostfs.info/TrueCloudLab/frostfs-node/issues/1574#issuecomment-63238) 1. `Request` and `Resource` types return properties with `Properties()` 2. `ChainRouterError` is extended. The error message has not been changed, but the type provides more fields 3. Object service is logging all related to ape check info in case of `ChainRouterError` occurred 4. Tree service is logging all related to ape check info in case of `ChainRouterError` occurred Close #1574
aarifullin added 4 commits 2025-02-24 16:53:54 +00:00
* Both methods are helpers only, they are not required for interface
  implementation.

Signed-off-by: Airat Arifullin <a.arifullin@yadro.com>
* Introduce new fields and getters for them;
* Fix `CheckAPE` in `checkerCoreImpl` at `newChainRouterError`.

Signed-off-by: Airat Arifullin <a.arifullin@yadro.com>
Signed-off-by: Airat Arifullin <a.arifullin@yadro.com>
[#1574] tree: Introduce debug logging for APE check failures
All checks were successful
DCO action / DCO (pull_request) Successful in 31s
Vulncheck / Vulncheck (pull_request) Successful in 1m0s
Pre-commit hooks / Pre-commit (pull_request) Successful in 1m29s
Build / Build Components (pull_request) Successful in 1m33s
Tests and linters / Lint (pull_request) Successful in 3m0s
Tests and linters / Staticcheck (pull_request) Successful in 2m55s
Tests and linters / Run gofumpt (pull_request) Successful in 2m56s
Tests and linters / Tests (pull_request) Successful in 3m5s
Tests and linters / gopls check (pull_request) Successful in 3m56s
Tests and linters / Tests with -race (pull_request) Successful in 4m1s
060e4b2f76
Signed-off-by: Airat Arifullin <a.arifullin@yadro.com>
requested reviews from storage-core-developers, storage-core-committers, dkirillov 2025-02-24 16:54:11 +00:00
dkirillov approved these changes 2025-02-25 06:48:25 +00:00
Dismissed
dkirillov left a comment
Member

LGTM, but what about container operations?

LGTM, but what about container operations?
Author
Member

@dkirillov wrote in #1655 (comment):

LGTM, but what about container operations?

Oh, true. Forgot about this service. Let me fix this

@dkirillov wrote in https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/1655#issuecomment-68378: > LGTM, but what about container operations? Oh, true. Forgot about this service. Let me fix this
fyrchik reviewed 2025-02-25 09:20:43 +00:00
@ -31,0 +38,4 @@
logger.Debug(ctx, logs.APECheckDeniedRequest,
zap.Stringer("status", chainRouterErr.Status()),
zap.String("operation", chainRouterErr.Request().Operation()),
zap.Any("target", chainRouterErr.Target()),
Owner

It might be easier to implement custom encoder for *apecommon.ChainRouterError type (or another custom type defined in this package).
Sth like "request": {"properties": {...}, "target": ...}, "resource": {"name": ..., "properties": ...}.
Have you considered this? 6d482535bd/zapcore/marshaler.go (L30)

Why?

  1. Because .Any or .Stringer allocates (likely) even when debug is enabled.
  2. Because passing variadic parameter passing may allocate too.
  3. Because .Request().Properties() may allocate too.
It _might_ be easier to implement custom encoder for `*apecommon.ChainRouterError` type (or another custom type defined in this package). Sth like `"request": {"properties": {...}, "target": ...}, "resource": {"name": ..., "properties": ...}`. Have you considered this? https://github.com/uber-go/zap/blob/6d482535bdd97f4d97b2f9573ac308f1cf9b574e/zapcore/marshaler.go#L30 Why? 1. Because `.Any` or `.Stringer` allocates (likely) even when debug is enabled. 2. Because passing variadic parameter passing may allocate too. 3. Because `.Request().Properties()` may allocate too.
Author
Member

I have implemented an encoder for *apecommon.ChainRouterError. Please, check.

I agreed with your suggestion

I have implemented an encoder for `*apecommon.ChainRouterError`. Please, check. I agreed with your suggestion
Owner

Looks neat!

Looks neat!
fyrchik marked this conversation as resolved
fyrchik reviewed 2025-02-25 09:22:41 +00:00
@ -181,2 +220,4 @@
if err := p.apeChecker.CheckAPE(ctx, prm); err != nil {
var chainRouterErr *apecommon.ChainRouterError
if errors.As(err, &chainRouterErr) {
Owner

I think this is a bad practice:
Logging an error is one way to handle an error.
At the same time we return this error too, so it may be handled twice.
Do you have any other solutions in mind?

I think this is a bad practice: Logging an error is one way to _handle_ an error. At the same time we _return_ this error too, so it may be handled twice. Do you have any other solutions in mind?
Author
Member

Logging an error is one way to handle an error.

Are you talking about auto-tests? Because I have no idea who else might be processing error from debug logs.

Do you have any other solutions in mind?

A client should receive a status going along with concise reason. We can't make an error message that contains thorough details about failure - it may cause a degradation as the server will send big messages to the client on unsuccessful checks.

The idea "client gets the chain router check error. We go to server logs to find out the true reason" is absolutely OK with me

> Logging an error is one way to handle an error. Are you talking about auto-tests? Because I have no idea who else might be processing error from debug logs. > Do you have any other solutions in mind? A client should receive a status going along with concise reason. We can't make an error message that contains thorough details about failure - it may cause a degradation as the server will send big messages to the client on unsuccessful checks. The idea "client gets the chain router check error. We go to server logs to find out the true reason" is absolutely OK with me
Owner

We go to server logs to find out the true reason

Logging an error is perfectly fine.
Logging it twice is not.
In your example it is logged, then returned (on the caller it can also be logged (again), then returned etc.).
The easiest rule of thumb to solve this problem is to handle errors once.
Thus, my comment.

>We go to server logs to find out the true reason Logging an error is perfectly fine. Logging it twice is not. In your example it is logged, then returned (on the caller it can also be logged (again), then returned etc.). The easiest rule of thumb to solve this problem is to handle errors once. Thus, my comment.
Owner

As an example of how it might be solved: provide details for all requests and hide it behind logging tag.
This way it is not an error and no garbage is seen.

As an example of how it might be solved: provide details for all requests and hide it behind logging tag. This way it is not an error and no garbage is seen.
Author
Member

In your example it is logged, then returned (on the caller it can also be logged (again), then returned etc.).

I caught your idea for now. Yes, I agreed. This point is absolutely fair

> In your example it is logged, then returned (on the caller it can also be logged (again), then returned etc.). I caught your idea for now. Yes, I agreed. This point is absolutely fair
Author
Member

Let me explain why this is barely possible and justify the current approach

At the same time we return this error too, so it may be handled twice.

  1. ChainRouterError won't be processed from calling side ever. For instance, object service request executor (analyzeStatus) gets ObjectAccessDenied error with a reason. It still gets the same message ("access to operation <method> is denied by access policy engine: <status>"). So,ChainRouterError is kind of an intermediate error and its encoder used to log messages only in APE middleware. Please, check out how ChainRouterError implements Error() and how it goes to apeErr.
  2. No way to carry the error as logging tag. We should keep some information within the context, but this won't work for methods like Put and Patch (grpc-streaming and all)
  3. Operations which are logging "operation finished with error" are Get, Delete and Search.
    Put and Patch are logged with audit middleware on Info level
  4. I suggest to log message only when error occurs. Logging every operation will blow journalctl
  5. So, if take in account 1), 2), 3), 4) then I can conclude that the current logging is very helpful and fine

I'd like to leave this as it is and we'll be fine

Let me explain why this is barely possible and justify the current approach > At the same time we return this error too, so it may be handled twice. 1. `ChainRouterError` won't be processed from calling side ever. For instance, object service request executor (`analyzeStatus`) gets `ObjectAccessDenied` error with a reason. It still gets the same message (`"access to operation <method> is denied by access policy engine: <status>"`). So,`ChainRouterError` is kind of an intermediate error and its encoder used to log messages only in APE middleware. Please, check out how `ChainRouterError` implements [Error()](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/ab3922488a671e0e5d342036eff9d73b2ef19e01/pkg/services/common/ape/error.go#L21) and how it goes to [apeErr](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/ab3922488a671e0e5d342036eff9d73b2ef19e01/pkg/services/common/ape/error.go#L21). 2. No way to carry the error as [logging tag](https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/1655#issuecomment-68578). We should keep some information within the context, but this won't work for methods like `Put` and `Patch` (grpc-streaming and all) 3. Operations which are logging `"operation finished with error"` are `Get`, `Delete` and `Search`. `Put` and `Patch` are logged with `audit` middleware on `Info` level 4. I suggest to log message only when error occurs. Logging every operation will blow `journalctl` 5. So, if take in account 1), 2), 3), 4) then I can conclude that the current logging is very helpful and fine I'd like to leave this as it is and we'll be fine
Owner

That we don't see how to do better, doesn't imply the current approach is either helpful or fine.

ChainRouterError won't be processed from calling side ever.

Agree. But should it be an error then? What we log is operation context.

No way to carry the error as logging tag. We should keep some information within the context

We shouldn't, it specifies a place in code and is not context.Context-dependent.
log.WithTag(TagAPEOperationContext).Debug() will do the job.
See #1619.

I suggest to log message only when error occurs.

Why, though? #1574 talks about understanding "what goes wrong". But unexpected ALLOW could also happen, and they also need to be debugged. Having someone access my private container is definitely a bug, even a more dangerous one.

Logging every operation will blow journalctl

As enabling debug level already do. Having a separate tag looks like a perfectly valid usecase for tags to me.
That PR is in review, but if we had a single function that logs verbose info reused between container and object services, that would help tag adoption. And, of course, it depends on the load, consider node living in the old epoch spamming HEAD requests to all nodes in the container.

Currently, container and object services do the same thing here, but in different ways.
The common denominator seems to be router (used directly in container and indirectly via CheckAPE in object).
Maybe have a custom router that logs everything?

That we don't see how to do better, doesn't imply the current approach is either helpful or fine. >ChainRouterError won't be processed from calling side ever. Agree. But should it be an error then? What we log is operation context. >No way to carry the error as logging tag. We should keep some information within the context We shouldn't, it specifies a place in code and is not `context.Context`-dependent. `log.WithTag(TagAPEOperationContext).Debug()` will do the job. See #1619. >I suggest to log message only when error occurs. Why, though? #1574 talks about understanding "what goes wrong". But unexpected ALLOW could also happen, and they also need to be debugged. Having someone access my private container is definitely a bug, even a more dangerous one. >Logging every operation will blow journalctl As enabling debug level already do. Having a separate tag looks like a perfectly valid usecase for tags to me. That PR is in review, but if we had a _single_ function that logs verbose info reused between container and object services, that would help tag adoption. And, of course, it depends on the load, consider node living in the old epoch spamming HEAD requests to all nodes in the container. Currently, container and object services do the same thing here, but in different ways. The common denominator seems to be `router` (used directly in `container` and indirectly via `CheckAPE` in `object`). Maybe have a custom router that logs everything?
fyrchik reviewed 2025-02-25 09:24:57 +00:00
@ -68,1 +84,3 @@
return &Service{
func NewService(apeChecker Checker, next objectSvc.ServiceServer, opts ...func(s *Service)) *Service {
s := &Service{
logger: logger.NewLoggerWrapper(zap.L()),
Owner

The default should be zap.NewNop(), because we don't want to log sensitive data somewhere, we want to precisely control all the sinks.

The default should be `zap.NewNop()`, because we don't want to log sensitive data somewhere, we want to precisely control all the sinks.
Owner

Or, even better, make this argument explicit and remove variadic opts array.

Or, even better, make this argument explicit and remove variadic `opts` array.
Author
Member

Fixed

Fixed
fyrchik marked this conversation as resolved
aarifullin force-pushed feat/improve_ape_log from 060e4b2f76 to bd30e4838f 2025-02-25 13:28:48 +00:00 Compare
aarifullin dismissed dkirillov's review 2025-02-25 13:28:49 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

Author
Member

@dkirillov wrote in #1655 (comment):

LGTM, but what about container operations?

Added for container service

@dkirillov wrote in https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/1655#issuecomment-68378: > LGTM, but what about container operations? Added for container service
aarifullin force-pushed feat/improve_ape_log from bd30e4838f to ab3922488a 2025-02-25 14:07:03 +00:00 Compare
fyrchik reviewed 2025-02-26 12:31:45 +00:00
@ -34,0 +85,4 @@
func (e *ChainRouterError) marshalRequest(enc zapcore.ObjectEncoder) error {
enc.AddString("operation", e.request.Operation())
if err := enc.AddObject("target", zapcore.ObjectMarshalerFunc(
Owner

enc.AddObject("target", zapcore.ObjectMarshalerFunc(e.marshalTarget)) should work too and takes less space.

`enc.AddObject("target", zapcore.ObjectMarshalerFunc(e.marshalTarget))` should work too and takes less space.
Owner

Also, in some other places this could be simplified too.

Also, in some other places this could be simplified too.
Author
Member

fixed

fixed
fyrchik marked this conversation as resolved
dkirillov reviewed 2025-02-27 08:22:22 +00:00
@ -110,1 +123,4 @@
if err := g.apeChecker.CheckAPE(g.Context(), prm); err != nil {
var chainRouterErr *apecommon.ChainRouterError
if errors.As(err, &chainRouterErr) {
logChainRouterError(g.Context(), g.logger, chainRouterErr)
Member

Why don't we log error just once in g.apeChecker.CheckAPE

func (c *checkerImpl) CheckAPE(ctx context.Context, prm Prm) error {

?

Why don't we log error just once in `g.apeChecker.CheckAPE` https://git.frostfs.info/aarifullin/frostfs-node/src/commit/ab3922488a671e0e5d342036eff9d73b2ef19e01/pkg/services/object/ape/checker.go#L77 ?
aarifullin force-pushed feat/improve_ape_log from ab3922488a to 86a0292145 2025-02-27 10:28:42 +00:00 Compare
acid-ant approved these changes 2025-03-04 12:20:02 +00:00
All checks were successful
DCO action / DCO (pull_request) Successful in 33s
Required
Details
Vulncheck / Vulncheck (pull_request) Successful in 1m2s
Required
Details
Pre-commit hooks / Pre-commit (pull_request) Successful in 1m28s
Required
Details
Build / Build Components (pull_request) Successful in 1m37s
Required
Details
Tests and linters / Run gofumpt (pull_request) Successful in 3m9s
Required
Details
Tests and linters / Lint (pull_request) Successful in 3m29s
Required
Details
Tests and linters / Staticcheck (pull_request) Successful in 3m25s
Required
Details
Tests and linters / Tests (pull_request) Successful in 3m39s
Required
Details
Tests and linters / Tests with -race (pull_request) Successful in 4m15s
Required
Details
Tests and linters / gopls check (pull_request) Successful in 4m17s
Required
Details
This pull request has changes conflicting with the target branch.
  • internal/logs/logs.go
  • pkg/services/object/ape/service.go
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u feat/improve_ape_log:aarifullin-feat/improve_ape_log
git checkout aarifullin-feat/improve_ape_log
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1655
No description provided.