Blobovnicza GET after PUT is inconsistent under high concurrency #536

Closed
opened 2023-07-20 09:53:06 +00:00 by ale64bit · 1 comment
Collaborator

When issuing a synchronous GET call immediately after PUT while the storage is under heavy concurrent usage, sometimes GET returns object not found.

Expected Behavior

It should either:

  1. Block indefinitely and allow cancellation via context
  2. Return an error that is representative of the problem (e.g. UNAVAILABLE in gRPC terminology, which is canonically retryable)

Current Behavior

Sporadically returns object not found.

Possible Solution

Up for discussion.

Steps to Reproduce (for bugs)

func TestBlobugnicza(t *testing.T) {
    const n = 10000

    st := blobovniczatree.NewBlobovniczaTree(
        blobovniczatree.WithRootPath(t.TempDir()),
    )
    require.NoError(t, st.Open(false))
    require.NoError(t, st.Init())
    t.Cleanup(func() { require.NoError(t, st.Close()) })

    objGen := &testutil.SeqObjGenerator{ObjSize: 1}

    var cnt atomic.Int64
    var wg sync.WaitGroup
    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for cnt.Add(1) <= n {
                obj := objGen.Next()
                addr := testutil.AddressFromObject(t, obj)

                raw, err := obj.Marshal()
                require.NoError(t, err)

                _, err = st.Put(context.Background(), common.PutPrm{
                    Address: addr,
                    RawData: raw,
                })
                require.NoError(t, err)

                _, err = st.Get(context.Background(), common.GetPrm{Address: addr})
                require.NoError(t, err) // fails very often, correlated to how many goroutines are started
            }
        }()
    }

    wg.Wait()
}
When issuing a synchronous `GET` call immediately after `PUT` while the storage is under heavy concurrent usage, sometimes `GET` returns `object not found`. ## Expected Behavior It should either: 1. Block indefinitely and allow cancellation via context 2. Return an error that is representative of the problem (e.g. `UNAVAILABLE` in gRPC terminology, which is canonically retryable) ## Current Behavior Sporadically returns `object not found`. ## Possible Solution Up for discussion. ## Steps to Reproduce (for bugs) ```go func TestBlobugnicza(t *testing.T) { const n = 10000 st := blobovniczatree.NewBlobovniczaTree( blobovniczatree.WithRootPath(t.TempDir()), ) require.NoError(t, st.Open(false)) require.NoError(t, st.Init()) t.Cleanup(func() { require.NoError(t, st.Close()) }) objGen := &testutil.SeqObjGenerator{ObjSize: 1} var cnt atomic.Int64 var wg sync.WaitGroup for i := 0; i < 1000; i++ { wg.Add(1) go func() { defer wg.Done() for cnt.Add(1) <= n { obj := objGen.Next() addr := testutil.AddressFromObject(t, obj) raw, err := obj.Marshal() require.NoError(t, err) _, err = st.Put(context.Background(), common.PutPrm{ Address: addr, RawData: raw, }) require.NoError(t, err) _, err = st.Get(context.Background(), common.GetPrm{Address: addr}) require.NoError(t, err) // fails very often, correlated to how many goroutines are started } }() } wg.Wait() } ```
ale64bit added the
bug
discussion
triage
labels 2023-07-20 09:53:06 +00:00

The problems is with opened_cache_size -- if it is small, we can have side-effects: blobovniczas are opened and closed concurrently because of the limited cache and nothing prevents the DB from being closed while some object is being read.

I suggest remove opened_cache_size completely and always cache everything in memory:

  1. For low-mem systems just make the tree smaller (or use FSTree).
  2. We can describe how much the cache (it is just a map) takes in memory, to ease the configuration.
  3. Also could help us to avoid bugs in future (this is one of the most bug-prone code parts).
The problems is with `opened_cache_size` -- if it is small, we can have side-effects: blobovniczas are opened and closed concurrently because of the limited cache and nothing prevents the DB from being closed while some object is being read. I suggest remove `opened_cache_size` completely and always cache everything in memory: 1. For low-mem systems just make the tree smaller (or use FSTree). 2. We can describe how much the cache (it is just a map) takes in memory, to ease the configuration. 3. Also could help us to avoid bugs in future (this is one of the most bug-prone code parts).
fyrchik added this to the v0.37.0 milestone 2023-07-25 06:59:26 +00:00
fyrchik added the
frostfs-node
label 2023-07-27 17:03:30 +00:00
fyrchik modified the milestone from v0.37.0 to v0.38.0 2023-08-29 09:35:33 +00:00
dstepanov-yadro was assigned by fyrchik 2023-08-30 09:07:38 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#536
There is no content yet.