Flaky test: engine_test.go (TestLockExpiration, TestLockUserScenario, ...) #1700

Closed
opened 2025-03-31 16:18:11 +00:00 by potyarkin · 2 comments
Member

I've noticed that one of our tests is flaky: example A, example B. Crash happens here, this method is being called from several test cases: TestLockExpiration, TestLockUserScenario, probably others.

func (t *testQoSLimiter) Close() {
require.Equal(t.t, int64(0), t.read.Load(), "read requests count after limiter close must be 0")
require.Equal(t.t, int64(0), t.write.Load(), "write requests count after limiter close must be 0")
}

Test failure looks like this:

engine_test.go:169: 
      Error Trace:  ./pkg/local_object_storage/engine/engine_test.go:169
                          ./pkg/local_object_storage/shard/control.go:399
                          ./pkg/local_object_storage/engine/control.go:161
                          ./pkg/local_object_storage/engine/control.go:210
                          ./pkg/local_object_storage/engine/control.go:152
                          ./pkg/local_object_storage/engine/lock_test.go:166
                          ./pkg/local_object_storage/engine/lock_test.go:218
      Error:        Not equal: 
                    expected: 0
                    actual  : 1
      Test:         TestLockExpiration
      Messages:     write requests count after limiter close must be 0

This is probably related to (or is a regression of) #788.

My intuition says that the flakiness might be caused by running tests in parallel:

func TestLockExpiration(t *testing.T) {
t.Parallel()

func TestLockUserScenario(t *testing.T) {
t.Parallel()

To confirm my intuition I've ran a 1000 repetitions of TestLockExpiration and a 1000 repetitions of the full test suite. Without parallel tests TestLockExpiration didn't fail and with parallel tests the expected error occured long before 1000 iterations (same error although in another test case, TestLockUserScenario). Here are both runs (beware that second log is 600MB+, don't click "View full" in Firefox - download to a local file). This is by no means an exhaustive evidence, but I think this hunch warrants further investigation.

I'm not sure if the particular test cases just should not be parallelized or it's a sign of some deeper underlying problem, so I'm bringing it here.

I've noticed that one of our tests is flaky: [example A](https://ci.frostfs.info/job/gerrit/job/frostfs-node/163/pipeline-console/?start-byte=0&selected-node=124#log-215), [example B](https://ci.frostfs.info/job/gerrit/job/frostfs-node/189/pipeline-console/?selected-node=18). Crash happens [here](https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/12a0537a7a9f5af0b4c2bd20f1d5b9f1891897d4/pkg/local_object_storage/engine/engine_test.go#L167-L170), this method is being called from several test cases: `TestLockExpiration`, `TestLockUserScenario`, probably others. https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/12a0537a7a9f5af0b4c2bd20f1d5b9f1891897d4/pkg/local_object_storage/engine/engine_test.go#L167-L170 Test failure looks like this: ``` engine_test.go:169: Error Trace: ./pkg/local_object_storage/engine/engine_test.go:169 ./pkg/local_object_storage/shard/control.go:399 ./pkg/local_object_storage/engine/control.go:161 ./pkg/local_object_storage/engine/control.go:210 ./pkg/local_object_storage/engine/control.go:152 ./pkg/local_object_storage/engine/lock_test.go:166 ./pkg/local_object_storage/engine/lock_test.go:218 Error: Not equal: expected: 0 actual : 1 Test: TestLockExpiration Messages: write requests count after limiter close must be 0 ``` This is probably related to (or is a regression of) #788. My intuition says that the flakiness might be caused by running tests in parallel: https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/12a0537a7a9f5af0b4c2bd20f1d5b9f1891897d4/pkg/local_object_storage/engine/lock_test.go#L144-L145 https://git.frostfs.info/TrueCloudLab/frostfs-node/src/commit/12a0537a7a9f5af0b4c2bd20f1d5b9f1891897d4/pkg/local_object_storage/engine/lock_test.go#L32-L33 To confirm my intuition I've ran a 1000 repetitions of `TestLockExpiration` and a 1000 repetitions of the full test suite. Without parallel tests `TestLockExpiration` didn't fail and with parallel tests the expected error occured long before 1000 iterations (same error although in another test case, `TestLockUserScenario`). [Here are both runs](https://ci.frostfs.info/job/gerrit/job/frostfs-node/198/) (beware that second log is 600MB+, don't click "View full" in Firefox - download to a local file). This is by no means an exhaustive evidence, but I think this hunch warrants further investigation. I'm not sure if the particular test cases just should not be parallelized or it's a sign of some deeper underlying problem, so I'm bringing it here.
potyarkin added the
bug
help wanted
labels 2025-03-31 16:18:11 +00:00
Member

My intuition says that the flakiness might be caused by running tests in parallel:

To confirm my intuition I've ran a 1000 repetitions of TestLockExpiration and a 1000 repetitions of the full test suite. Without parallel tests TestLockExpiration didn't fail and with parallel tests the expected error occured long before 1000 iterations (same error although in another test case, TestLockUserScenario)

That's interesting. Each test uses its own engine instance, so they shouldn't affect each other, but your experiment says otherwise.

> My intuition says that the flakiness might be caused by running tests in parallel: > To confirm my intuition I've ran a 1000 repetitions of TestLockExpiration and a 1000 repetitions of the full test suite. Without parallel tests TestLockExpiration didn't fail and with parallel tests the expected error occured long before 1000 iterations (same error although in another test case, TestLockUserScenario) That's interesting. Each test uses its own engine instance, so they shouldn't affect each other, but your experiment says otherwise.
fyrchik added the
P1
label 2025-03-31 17:43:30 +00:00
dstepanov-yadro was assigned by fyrchik 2025-03-31 17:43:38 +00:00
fyrchik added this to the v0.45.0 milestone 2025-03-31 17:43:40 +00:00
Author
Member

My experiment is circumstantial evidence at best. There is a non-negligible possibility that my intuition is wrong.

My experiment is circumstantial evidence at best. There is a non-negligible possibility that my intuition is wrong.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#1700
No description provided.