gc: Skip received event if previous still processing #1228

Closed
acid-ant wants to merge 1 commit from acid-ant/frostfs-node:bugfix/gc-unblock-evnt-handler into master
Member

Unblocking send in the notification channel fixed by #1238
But we still have an issue with handling multiple events:

  • Processed Event#1
  • Event#2 stuck on waiting for end of the processing event_1
  • Event#3 and the following will be skipped.

Signed-off-by: Anton Nikiforov an.nikiforov@yadro.com

Unblocking send in the notification channel fixed by https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/1238 But we still have an issue with handling multiple events: - Processed Event#1 - Event#2 stuck on waiting for end of the processing event_1 - Event#3 and the following will be skipped. Signed-off-by: Anton Nikiforov <an.nikiforov@yadro.com>
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from 67133a3766 to cb113bd046 2024-07-04 14:51:25 +00:00 Compare
acid-ant changed title from gc: Skip received event if previous still processing to WIP: gc: Skip received event if previous still processing 2024-07-05 06:54:23 +00:00
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from cb113bd046 to 0b95f40cba 2024-07-05 10:26:19 +00:00 Compare
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from 0b95f40cba to c524dfb80c 2024-07-05 10:40:49 +00:00 Compare
acid-ant requested review from storage-core-committers 2024-07-05 10:56:00 +00:00
acid-ant requested review from storage-core-developers 2024-07-05 10:56:01 +00:00
acid-ant changed title from WIP: gc: Skip received event if previous still processing to gc: Skip received event if previous still processing 2024-07-05 10:56:08 +00:00
dstepanov-yadro reviewed 2024-07-05 11:57:56 +00:00
@ -178,19 +178,14 @@ func (gc *gc) listenEvents(ctx context.Context) {
}
func (gc *gc) handleEvent(ctx context.Context, event Event) {
if !gc.processEvnt.CompareAndSwap(false, true) {

Here is compare and swap false -> true, but where is true -> false swap?

Here is compare and swap `false` -> `true`, but where is `true` -> `false` swap?
Author
Member

Thanks, lost it after refactoring. Added routine which will wait for result of processing.

Thanks, lost it after refactoring. Added routine which will wait for result of processing.
acid-ant changed title from gc: Skip received event if previous still processing to WIP: gc: Skip received event if previous still processing 2024-07-07 18:13:32 +00:00
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from c524dfb80c to 1f63926ab2 2024-07-08 06:53:31 +00:00 Compare
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from 1f63926ab2 to 70148790c3 2024-07-08 07:52:25 +00:00 Compare
acid-ant changed title from WIP: gc: Skip received event if previous still processing to gc: Skip received event if previous still processing 2024-07-08 08:14:30 +00:00
fyrchik reviewed 2024-07-08 12:16:36 +00:00
fyrchik left a comment
Owner

Other than the comments, LGTM

Other than the comments, LGTM
@ -102,6 +101,7 @@ type gc struct {
onceStop sync.Once
stopChannel chan struct{}
wg sync.WaitGroup
processEvnt atomic.Bool
Owner

typo in processEvent?
If you want not to make other lines take part in diff, could add empty line before the new field declaration

typo in `processEvent`? If you want not to make other lines take part in diff, could add empty line before the new field declaration
Author
Member

Thanks, I'll take this into account in the future.

Thanks, I'll take this into account in the future.
fyrchik marked this conversation as resolved
@ -181,1 +182,4 @@
gc.log.Warn(logs.ShardEventProcessingInProgress)
return
}
v, ok := gc.mEventHandler[event.typ()]
Owner

Just to clarify: do we have only 1 event type?

Just to clarify: do we have only 1 event type?
Author
Member

Right now, yes - only for new epoch.

Right now, yes - only for new epoch.
fyrchik marked this conversation as resolved
@ -212,2 +206,4 @@
v.wg.Done()
}
}
go func() {
Owner

It is not obvious, why this code should not be executed if we exit prematurely.
Could you elaborate a bit on why this is the right behaviour?

It is not obvious, why this code should not be executed if we exit prematurely. Could you elaborate a bit on why this is the right behaviour?
Author
Member

I don't like this approach too. It was compromise, because we are handling only one type of events.
Refactored a bit to avoid leaking of the routines.

I don't like this approach too. It was compromise, because we are handling only one type of events. Refactored a bit to avoid leaking of the `routines`.
fyrchik marked this conversation as resolved
fyrchik reviewed 2024-07-08 12:18:06 +00:00
@ -212,2 +207,4 @@
}
}
go func() {
gc.log.Debug(logs.ShardEventProcessingWait)
Owner

Also, it seems this goroutine can leak, who waits until it is finished?

Also, it seems this goroutine can leak, who waits until it is finished?
dstepanov-yadro reviewed 2024-07-09 07:40:05 +00:00
@ -213,1 +208,4 @@
}
go func() {
gc.log.Debug(logs.ShardEventProcessingWait)
v.wg.Wait()

Why waiting is performed by separate goroutine?

Why waiting is performed by separate goroutine?
Author
Member

Refactored this part. Please review.

Refactored this part. Please review.
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from 70148790c3 to d0ac7897e1 2024-07-09 08:02:06 +00:00 Compare
dstepanov-yadro approved these changes 2024-07-09 08:19:45 +00:00
aarifullin reviewed 2024-07-09 09:24:20 +00:00
@ -203,2 +204,2 @@
defer v.prevGroup.Done()
h(runCtx, event)
h(ctx, event)
n := v.runningHandlers.Add(-1)
Member

Let v.runningHandlers = 1 before this assignment. When it's decremented, it becomes 0 and the next coming event is able to perform CompareAndSwap above. As soon as the next event passes through this gate, it starts iterating over v.handlers and increments wait group's counter v.wg.Add(1). Can't this cause a hang in v.wg.Done() within submitted handler that's not finished yet?

Let `v.runningHandlers = 1` before this assignment. When it's decremented, it becomes `0` and the next coming event is able to perform `CompareAndSwap` above. As soon as the next event passes through this gate, it starts iterating over `v.handlers` and increments wait group's counter `v.wg.Add(1)`. Can't this cause a hang in `v.wg.Done()` within submitted handler that's not finished yet?
Author
Member

v.wg.Done() is equal to v.wg.Add(-1), and according to source code is an action upon atomic.
As I understand, you are talking about situation, when we are able to submit in a pool, but there are no ready workers to process submitted action, because previous submitted action unblocked processing, but still in a working queue.
I don't think it is a problem.

`v.wg.Done()` is equal to `v.wg.Add(-1)`, and according to source code is an action upon `atomic`. As I understand, you are talking about situation, when we are able to submit in a pool, but there are no ready workers to process submitted action, because previous submitted action unblocked processing, but still in a working queue. I don't think it is a problem.
Member

Sorry, I quickly glanced at the code and I mistook Done for Wait.
Yes, in this case I believe we have no problem

Sorry, I quickly glanced at the code and I mistook `Done` for `Wait`. Yes, in this case I believe we have no problem
aarifullin marked this conversation as resolved
aarifullin approved these changes 2024-07-09 10:51:59 +00:00
achuprov approved these changes 2024-07-09 10:56:36 +00:00
fyrchik requested changes 2024-07-09 12:50:04 +00:00
@ -195,2 +192,4 @@
select {
case <-ctx.Done():
n := v.runningHandlers.Add(int32(i - len(v.handlers)))
if n == 0 {
Owner

We have this Add(-1) + log in 3 different places, that's 12 lines for one debug log. Can it be made simpler?

We have this `Add(-1) + log` in 3 different places, that's 12 lines for one debug log. Can it be made simpler?
Author
Member

You are right, wrapped this part into the function.

You are right, wrapped this part into the function.
Owner

Anyway, how is this solution different from non-blocking send in the notification channel in the epoch handler?

Anyway, how is this solution different from non-blocking send in the notification channel in the epoch handler?
Author
Member

You are talking about buffered channel?

You are talking about buffered channel?
Owner

It doesn't matter, our handler hangs, with non-blocking send it won't.

It doesn't matter, our handler hangs, with non-blocking send it won't.
Author
Member

It hangs only for submitting into the pool. You think we should reduce this time too?

It hangs only for submitting into the pool. You think we should reduce this time too?
Author
Member

@fyrchik, created #1238.

@fyrchik, created https://git.frostfs.info/TrueCloudLab/frostfs-node/pulls/1238.
Owner

Submit to the pool stops because all workers in it are busy.
All workers are busy because they block on channel send.

#1238 solves this problem completely, do we need this PR?

Submit to the pool stops because all workers in it are busy. All workers are busy because they block on channel send. #1238 solves this problem completely, do we need this PR?
Author
Member

Looks like we can close it because we have only one event type, and it is not a big deal for us to wait for a completion, but not for a new notification.

Looks like we can close it because we have only one event type, and it is not a big deal for us to wait for a completion, but not for a new notification.
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from d0ac7897e1 to 2614f972fd 2024-07-09 14:32:00 +00:00 Compare
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from 2614f972fd to f6247edd24 2024-07-10 12:57:17 +00:00 Compare
acid-ant force-pushed bugfix/gc-unblock-evnt-handler from f6247edd24 to aeafa97f5a 2024-07-10 13:42:26 +00:00 Compare
acid-ant closed this pull request 2024-07-10 14:11:11 +00:00
All checks were successful
DCO action / DCO (pull_request) Successful in 3m27s
Required
Details
Vulncheck / Vulncheck (pull_request) Successful in 4m20s
Required
Details
Build / Build Components (1.22) (pull_request) Successful in 5m33s
Required
Details
Build / Build Components (1.21) (pull_request) Successful in 5m40s
Required
Details
Pre-commit hooks / Pre-commit (pull_request) Successful in 7m0s
Required
Details
Tests and linters / gopls check (pull_request) Successful in 7m41s
Required
Details
Tests and linters / Staticcheck (pull_request) Successful in 7m45s
Required
Details
Tests and linters / Lint (pull_request) Successful in 8m22s
Required
Details
Tests and linters / Tests (1.21) (pull_request) Successful in 12m25s
Required
Details
Tests and linters / Tests with -race (pull_request) Successful in 12m26s
Required
Details
Tests and linters / Tests (1.22) (pull_request) Successful in 12m51s
Required
Details

Pull request closed

Sign in to join this conversation.
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1228
No description provided.