Blobovnicza GET after PUT is inconsistent under high concurrency #536

New issue

Closed

opened 2023-07-20 09:53:06 +00:00 by ale64bit · 1 comment

ale64bit commented

2023-07-20 09:53:06 +00:00

Member

When issuing a synchronous GET call immediately after PUT while the storage is under heavy concurrent usage, sometimes GET returns object not found.

Expected Behavior

It should either:

Block indefinitely and allow cancellation via context
Return an error that is representative of the problem (e.g. UNAVAILABLE in gRPC terminology, which is canonically retryable)

Current Behavior

Sporadically returns object not found.

Possible Solution

Up for discussion.

Steps to Reproduce (for bugs)

func TestBlobugnicza(t *testing.T) {
    const n = 10000

    st := blobovniczatree.NewBlobovniczaTree(
        blobovniczatree.WithRootPath(t.TempDir()),
    )
    require.NoError(t, st.Open(false))
    require.NoError(t, st.Init())
    t.Cleanup(func() { require.NoError(t, st.Close()) })

    objGen := &testutil.SeqObjGenerator{ObjSize: 1}

    var cnt atomic.Int64
    var wg sync.WaitGroup
    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for cnt.Add(1) <= n {
                obj := objGen.Next()
                addr := testutil.AddressFromObject(t, obj)

                raw, err := obj.Marshal()
                require.NoError(t, err)

                _, err = st.Put(context.Background(), common.PutPrm{
                    Address: addr,
                    RawData: raw,
                })
                require.NoError(t, err)

                _, err = st.Get(context.Background(), common.GetPrm{Address: addr})
                require.NoError(t, err) // fails very often, correlated to how many goroutines are started
            }
        }()
    }

    wg.Wait()
}

When issuing a synchronous `GET` call immediately after `PUT` while the storage is under heavy concurrent usage, sometimes `GET` returns `object not found`. ## Expected Behavior It should either: 1. Block indefinitely and allow cancellation via context 2. Return an error that is representative of the problem (e.g. `UNAVAILABLE` in gRPC terminology, which is canonically retryable) ## Current Behavior Sporadically returns `object not found`. ## Possible Solution Up for discussion. ## Steps to Reproduce (for bugs) ```go func TestBlobugnicza(t *testing.T) { const n = 10000 st := blobovniczatree.NewBlobovniczaTree( blobovniczatree.WithRootPath(t.TempDir()), ) require.NoError(t, st.Open(false)) require.NoError(t, st.Init()) t.Cleanup(func() { require.NoError(t, st.Close()) }) objGen := &testutil.SeqObjGenerator{ObjSize: 1} var cnt atomic.Int64 var wg sync.WaitGroup for i := 0; i < 1000; i++ { wg.Add(1) go func() { defer wg.Done() for cnt.Add(1) <= n { obj := objGen.Next() addr := testutil.AddressFromObject(t, obj) raw, err := obj.Marshal() require.NoError(t, err) _, err = st.Put(context.Background(), common.PutPrm{ Address: addr, RawData: raw, }) require.NoError(t, err) _, err = st.Get(context.Background(), common.GetPrm{Address: addr}) require.NoError(t, err) // fails very often, correlated to how many goroutines are started } }() } wg.Wait() } ```

ale64bit added the

bug

discussion

triage

labels 2023-07-20 09:53:06 +00:00

fyrchik commented

2023-07-24 09:56:02 +00:00

Owner

The problems is with opened_cache_size -- if it is small, we can have side-effects: blobovniczas are opened and closed concurrently because of the limited cache and nothing prevents the DB from being closed while some object is being read.

I suggest remove opened_cache_size completely and always cache everything in memory:

For low-mem systems just make the tree smaller (or use FSTree).
We can describe how much the cache (it is just a map) takes in memory, to ease the configuration.
Also could help us to avoid bugs in future (this is one of the most bug-prone code parts).

The problems is with `opened_cache_size` -- if it is small, we can have side-effects: blobovniczas are opened and closed concurrently because of the limited cache and nothing prevents the DB from being closed while some object is being read. I suggest remove `opened_cache_size` completely and always cache everything in memory: 1. For low-mem systems just make the tree smaller (or use FSTree). 2. We can describe how much the cache (it is just a map) takes in memory, to ease the configuration. 3. Also could help us to avoid bugs in future (this is one of the most bug-prone code parts).