Retries in restic try to solve two main problems:
- retry a temporarily failed operation
- tolerate temporary network interruptions
The first problem only requires a few retries, whereas the last one benefits
primarily from spreading the requests over a longer duration.
Increasing the default multiplier and the initial interval works for
both cases. The first few retries only take a few seconds, while later
retries quickly reach the maximum interval of one minute. This ensures
that the total number of retries issued by restic will remain at around
21 retries for a 15 minute period. As the concurrency in restic is
bounded, retries drastically reduce the number of requests sent to a
backend. This helps to prevent overloading the backend.
Previously, if an operation failed after 15 minutes, then it would never
be retried. This means that large backend requests are more unreliable
than smaller ones.
Depending on how long an operation takes to fail, the total retry
duration can currently vary between 1.5 and 15 minutes. In particular
for temporarily interrupted network connections, the former timeout is
too short. Thus always use a limit of 15 minutes.
RemoveUnpacked will eventually block removal of all filetypes other than
snapshots. However, getting there requires a major refactor to provide
some components with privileged access.
This ensures that the pack header is actually read completely.
Previously, for a truncated file it was possible to only read a part of
the header, as backend.Load(...) is not guaranteed to return as many
bytes as requested by the length parameter.
This is inspired by the circuit breaker pattern used for distributed
systems. If too many requests fails, then it is better to immediately
fail new requests for a limited time to give the backend time to
recover.
By only forgetting a file in the cache at most once, we can ensure that
a broken file is only retrieved once again from the backend. If the file
stored there is broken, previously it would be cached and deleted
continuously. Now, it is retrieved only once again, all later requests
just use the cached copy and either succeed or fail immediately.
A file is always cached whole. Thus, any out of bounds access will also
fail when directed at the backend. To handle case in which the cached
file is broken, then caller must call Cache.Forget(h) for the file in
question.
LoadRaw also includes improved context cancellation handling similar to the
implementation in repository.LoadUnpacked.
The removed cache backend test will be added again later on.
If a file exhausts its retry attempts, then it is likely not accessible
the next time. Thus, immediately fail all load calls for that file to
avoid useless retries.