Incorrect cache behavior when reloading an image after deletion #7
Labels
No labels
Infrastructure
blocked
bug
config
discussion
documentation
duplicate
enhancement
go
help wanted
internal
invalid
kludge
observability
perfomance
question
refactoring
wontfix
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: TrueCloudLab/distribution#7
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When trying to load an image using the frostfs driver that has been deleted, the cache considers that the layers of this image are already in the storage and does not load them. Exactly the same behavior is observed when working with the driver for the local file system. If you use the s3 driver, then reloading images is valid. However, getting the image also fails.
Expected Behavior
It is expected that the following sequence of operations will work without errors:
Current Behavior
At the moment, when trying the second push (step 3 from the point above), the cache thinks that the image layers already exist and does not load them:
As a result, in the future we get an error when trying to pull an image:
Steps to Reproduce (for bugs)
Configuration and startup instructions
The driver configuration block for running distribution with frostfs (frostfs-dev-env):
The driver configuration block for running distribution with s3 (frostfs-aio):
The driver configuration block for running distribution with local file system:
How to start the registry:
Observations that have been identified.
1. Errors in the logs.
During the first push of the image, two similar errors are always observed in the logs. Moreover, they appear there both when using the frostfs driver and when using the s3 driver. The error looks like this:
This error occurs because the registry at some point requests a block that does not exist. Moreover, for the first time he does this at the very beginning of the push image operation. Why he does this is unclear. It doesn't seem to affect further work.
2. The state of the objects.
Each docker image in the distribution repository is a set of files (objects for frostfs). Brief information on the types of these files:
.../blobs/sha256/4a/4abcf.../data
- data for a specific image layer.../_layers/sha256/4abcf.../link
- A link to a layer in the repository links a layer to a specific image or tag in the repository, allowing you to manage the relationship between layers and images..../_manifests/revisions/sha256/6457d.../link
- A link to a specific revision (version) of the image in the repository..._manifests/tags/latest/index/sha256/6457d.../link
and - `.../_manifests/tags/latest/current/link - they point to the current image associated with the latest tag.Consider changing the state of objects stored in frostfs during the process of downloading and deleting an alpine image:
What interesting observations and conclusions have been made:
3. Why is the problem related to the cache?:
If you disable the cache in the config:
Then everything will work valid with both the frostfs driver and the s3 driver.
Fact: there are no obvious ways to clear the cache. Only if you restart the registry (since the in-memory cache).
4. Assumptions
In the process of its operation, the registry calls the methods of the driver interface. There is an assumption that the cache may not work properly due to the fact that one of these methods is not working correctly. First of all, the driver's methods fall into doubt:
GetContent
Stat
List
Walk
Since a lot depends on the result of their execution, including the operation of the cache. But during the verification process, no errors were found there.
The same issue in upstream https://github.com/distribution/distribution/issues/4269
@r.loginov Could you try instead of
this
I get different results when using tag instead of digest. Or we have to use digest?
Yes, I tried to use deletion using a tag. The result when using the tag is really different, but it is also wrong.
Based on the documentation, digest deletion should be used to delete the image.
Summary.
registry garbage-collect
call cannot update in-memory cache of the registry application, therefore described scenario should not work with any supported driver.It does work for most drivers, except
inmemory
andfilesystem
drivers. The reason is that these drivers does not support direct http access to blobs (one, two). HTTP access is used during blob HEAD request, it fails and triggers blob re-upload.If admin disables redirects, all drivers going to fail described scenario.
The only convinient thing to do is to explicitly disable cache usage for described scenario.