Revise logger levels #41

Open
opened 2023-02-02 17:11:35 +00:00 by carpawell · 3 comments
carpawell commented 2023-02-02 17:11:35 +00:00 (Migrated from github.com)

Suggest general rule: "info" for external events AND for accepting events to share them with others (pushing "new epoch" event to chain); "debug" for internal changes (node gets "new epoch" and logs it via "info" but its internal handling could be logged with "debug").

Also, as it was discussed before, we have an quite rare but really important events that deserve "info"/"warn" level (since we store objects, we can log their removal with "warn" to prevent/investigate errors).

Also, some important events can move from "debug" to "info":

  1. Tree service sync. It is an important service that often requires some investigation;
  2. Any contract writings: reputation, container sizes, balance changes, audit reports, etc;
  3. GC work: start, finished, removed object number, etc;
  4. Many "errors but not errors": no empty space/half empty space/quarter empty space on WC/blobovniczas;
  5. Access errors (?);
  6. Many real errors are in "debug" but could be in "warn": removing objects errors, GC work errors; removing shard, closing shard, etc.

Also, i would expand "debug" with logs about every step inside every main process: before/after every contract info fetching; before/after every network communication; before/after every disk operation.

Suggestions are appreciated.

Suggest general rule: "info" for external events AND for accepting events to share them with others (pushing "new epoch" event to chain); "debug" for internal changes (node gets "new epoch" and logs it via "info" but its internal handling could be logged with "debug"). Also, as it was discussed before, we have an quite rare but really important events that deserve "info"/"warn" level (since we store objects, we can log their removal with "warn" to prevent/investigate errors). Also, some important events can move from "debug" to "info": 1. Tree service sync. It is an important service that often requires some investigation; 2. Any contract writings: reputation, container sizes, balance changes, audit reports, etc; 3. GC work: start, finished, removed object number, etc; 4. Many "errors but not errors": no empty space/half empty space/quarter empty space on WC/blobovniczas; 5. Access errors (?); 6. Many real errors are in "debug" but could be in "warn": removing objects errors, GC work errors; removing shard, closing shard, etc. Also, i would expand "debug" with logs about every step inside every main process: before/after every contract info fetching; before/after every network communication; before/after every disk operation. Suggestions are appreciated.
acid-ant commented 2023-02-06 07:48:09 +00:00 (Migrated from github.com)

Also it would be nice to have info about endpoints in log entry, when node communicating with external services, not in node config only.

Also it would be nice to have info about endpoints in log entry, when node communicating with external services, not in node config only.
carpawell commented 2023-02-06 15:23:28 +00:00 (Migrated from github.com)

Also, i would consider moving some important data to our metrics. It could help us to find some important system changes and not just scroll logs for minutes/hours (see https://github.com/TrueCloudLab/frostfs-node/issues/17 for some of my thoughts).

Also, i would consider moving some important data to our metrics. It could help us to find some important system changes and not just scroll logs for minutes/hours (see https://github.com/TrueCloudLab/frostfs-node/issues/17 for some of my thoughts).
fyrchik commented 2023-02-06 17:19:49 +00:00 (Migrated from github.com)
  1. Tree service synchronization occurs regularly with lost of containers. I think INFO here may be an overkill. But sth like sync cycle took 5m is desireable.
  2. Do you mean any contract writing? IMO anything beyond new epoch and bootstrap queries is debug.
  3. Agreed with GC.
  4. So when WC is full, we should keep spamming in logs?
  5. IMO this is nice for control service, but for other services "access denied" is a valid scenario -- it even has a proper status.
  6. The problem I see with this is, again, possible spam after lot's of attempts. For non-logical errors we already have WARN (if they increase error counter), do we need it for something else?

Instead of having INFO for external events I would suggest having INFO for events that imply state transition. The first person who will be reading them is a service engineer. By "state" I mean node's view of a network + storage. Basically, INFO logs should be sequentially read to build a good approximation of what happens with node: which peers are connected, what shards are online etc.

So network is epoch, morph client switches, dropping/creating connections to other storage nodes (?), shard mode changes (non-automatic, automatic can be with WARN). GC/Sync cycles can also be INFO as they are included in this "state": no new sync cycle can start unless the old one finishes.

1. Tree service synchronization occurs regularly with lost of containers. I think INFO here may be an overkill. But sth like `sync cycle took 5m` is desireable. 2. Do you mean _any_ contract writing? IMO anything beyond `new epoch` and bootstrap queries is debug. 3. Agreed with GC. 4. So when WC is full, we should keep spamming in logs? 5. IMO this is nice for control service, but for other services "access denied" is a valid scenario -- it even has a proper status. 6. The problem I see with this is, again, possible spam after lot's of attempts. For non-logical errors we already have WARN (if they increase error counter), do we need it for something else? Instead of having INFO for external events I would suggest having INFO for events that imply _state transition_. The first person who will be reading them is a service engineer. By "state" I mean node's view of a network + storage. Basically, INFO logs should be sequentially read to build a good approximation of what happens with node: which peers are connected, what shards are online etc. So network is epoch, morph client switches, dropping/creating connections to other storage nodes (?), shard mode changes (non-automatic, automatic can be with WARN). GC/Sync cycles can also be INFO as they are included in this "state": no new sync cycle can start unless the old one finishes.
fyrchik added this to the v0.38.0 milestone 2023-05-18 08:49:46 +00:00
fyrchik added
P2
observability
and removed
P1
labels 2023-05-18 10:46:45 +00:00
fyrchik added
P1
and removed
P2
labels 2023-10-23 12:17:18 +00:00
fyrchik added
P0
and removed
P1
labels 2024-01-16 16:08:55 +00:00
fyrchik modified the milestone from v0.38.0 to v0.39.0 2024-02-28 19:26:45 +00:00
fyrchik modified the milestone from v0.39.0 to v0.40.0 2024-05-14 14:17:38 +00:00
fyrchik added
P1
and removed
P0
labels 2024-05-14 14:17:42 +00:00
fyrchik modified the milestone from v0.40.0 to v0.41.0 2024-06-01 09:19:54 +00:00
fyrchik modified the milestone from v0.41.0 to v0.42.0 2024-06-14 07:09:22 +00:00
fyrchik modified the milestone from v0.42.0 to v0.43.0 2024-07-23 06:34:49 +00:00
fyrchik modified the milestone from v0.43.0 to v0.44.0 2024-09-30 11:51:40 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#41
No description provided.