chunker: finish meta-format before release

changes:
- chunker: remove GetTier and SetTier
- remove wdmrcompat metaformat
- remove fastopen strategy
- make hash_type option non-advanced
- adverise hash support when possible
- add metadata field "ver", run strict checks
- describe internal behavior in comments
- improve documentation

note:
wdmrcompat used to write file name in the metadata, so maximum metadata
size was 1K; removing it allows to cap size by 200 bytes now.
This commit is contained in:
Ivan Andreev 2019-09-25 02:18:30 +03:00 committed by Nick Craig-Wood
parent c41812fc88
commit ccecfa9cb1
5 changed files with 303 additions and 312 deletions

View file

@ -4,11 +4,11 @@ description: "Split-chunking overlay remote"
date: "2019-08-30"
---
<i class="fa fa-cut"></i>Chunker
<i class="fa fa-cut"></i>Chunker (BETA)
----------------------------------------
The `chunker` overlay transparently splits large files into smaller chunks
during the upload to wrapped remote and transparently assembles them back
during upload to wrapped remote and transparently assembles them back
when the file is downloaded. This allows to effectively overcome size limits
imposed by storage providers.
@ -41,10 +41,27 @@ Storage> chunker
Remote to chunk/unchunk.
Normally should contain a ':' and a path, eg "myremote:path/to/dir",
"myremote:bucket" or maybe "myremote:" (not recommended).
Enter a string value. Press Enter for the default ("").
remote> remote:path
Files larger than chunk_size will be split in chunks. By default 2 Gb.
Files larger than chunk size will be split in chunks.
Enter a size with suffix k,M,G,T. Press Enter for the default ("2G").
chunk_size> 1G
chunk_size> 100M
Choose how chunker handles hash sums.
Enter a string value. Press Enter for the default ("md5").
Choose a number from below, or type in your own value
/ Chunker can pass any hash supported by wrapped remote
1 | for a single-chunk file but returns nothing otherwise.
\ "none"
2 / MD5 for multi-chunk files. Requires "simplejson".
\ "md5"
3 / SHA1 for multi-chunk files. Requires "simplejson".
\ "sha1"
/ Copying a file to chunker will request MD5 from the source
4 | falling back to SHA1 if unsupported. Requires "simplejson".
\ "md5quick"
5 / Similar to "md5quick" but prefers SHA1 over MD5. Requires "simplejson".
\ "sha1quick"
hash_type> md5
Edit advanced config? (y/n)
y) Yes
n) No
@ -53,8 +70,9 @@ Remote config
--------------------
[overlay]
type = chunker
remote = TestLocal:
chunk_size = 2G
remote = remote:bucket
chunk_size = 100M
hash_type = md5
--------------------
y) Yes this is OK
e) Edit this remote
@ -73,8 +91,8 @@ will put files in a directory called `name` in the current directory.
### Chunking
When rclone starts a file upload, chunker checks the file size.
If it doesn't exceed the configured chunk size, chunker will just pass it
When rclone starts a file upload, chunker checks the file size. If it
doesn't exceed the configured chunk size, chunker will just pass the file
to the wrapped remote. If a file is large, chunker will transparently cut
data in pieces with temporary names and stream them one by one, on the fly.
Each chunk will contain the specified number of data byts, except for the
@ -84,7 +102,7 @@ a temporary copy, record its size and repeat the above process.
When upload completes, temporary chunk files are finally renamed.
This scheme guarantees that operations look from outside as atomic.
A similar method with hidden temporary chunks is used for other operations
(copy/move/rename etc). If operation fails, hidden chunks are normally
(copy/move/rename etc). If an operation fails, hidden chunks are normally
destroyed, and the destination composite file stays intact.
#### Chunk names
@ -94,58 +112,52 @@ By default chunk names are `BIG_FILE_NAME.rclone-chunk.001`,
format is `*.rclone-chunk.###`. You can configure another name format
using the `--chunker-name-format` option. The format uses asterisk
`*` as a placeholder for the base file name and one or more consecutive
hash characters `#` as a placeholder for the chunk number. There must be
one and only one asterisk. The number of consecutive hashes defines the
minimum length of a string representing a chunk number. If a chunk number
has less digits than the number of hashes, it is left-padded by zeros.
If there are more digits in the number, they are left as is.
hash characters `#` as a placeholder for sequential chunk number.
There must be one and only one asterisk. The number of consecutive hash
characters defines the minimum length of a string representing a chunk number.
If decimal chunk number has less digits than the number of hashes, it is
left-padded by zeros. If the number stringis longer, it is left intact.
By default numbering starts from 1 but there is another option that allows
user to start from 0, eg. for compatibility with legacy software.
For example, if name format is `big_*-##.part`, and original file was
named `data.txt` and numbering starts from 0, then the first chunk will be
named `big_data.txt-00.part`, the 99th chunk will be `big_data.txt-98.part`
and the 302nd chunk will be `big_data.txt-301.part`.
For example, if name format is `big_*-##.part` and original file name is
`data.txt` and numbering starts from 0, then the first chunk will be named
`big_data.txt-00.part`, the 99th chunk will be `big_data.txt-98.part`
and the 302nd chunk will become `big_data.txt-301.part`.
Would-be chunk files are ignored if their name does not match given format.
The list command might encounter composite files with missinng or invalid
chunks. By default, if chunker detects a missing chunk it will silently
ignore the whole group. Use the `--chunker-fail-on-bad-chunks` flag
to make it fail with an error message.
When the `list` rclone command scans a directory on wrapped remote, the
potential chunk files are accounted for and merged into composite directory
entries only if their names match the configured format. All other files
are ignored, including temporary chunks.
The list command might encounter composite files with missing or invalid
chunks. If chunker detects a missing chunk it will by default silently
ignore the whole group. You can use the `--chunker-fail-on-bad-chunks`
command line flag to make `list` fail with an error message.
### Metadata
By default when a file is large enough, chunker will create a metadata
object besides data chunks. The object is named after the original file.
Chunker allows to choose between few metadata formats. Please note that
currently metadata is not created for files smaller than configured
chunk size. This may change in future as new formats are developed.
Chunker allows user to disable metadata completely (the `none` format).
Please note that currently metadata is not created for files smaller
than configured chunk size. This may change in future as new formats
are developed.
#### Simple JSON metadata format
This is the default format. It supports hash sums and chunk validation
for composite files. Meta objects carry the following fields:
- `size` - total size of chunks
- `nchunks` - number of chunks
- `md5` - MD5 hashsum (if present)
- `ver` - version of format, currently `1`
- `size` - total size of composite file
- `nchunks` - number of chunks in the file
- `md5` - MD5 hashsum of composite file (if present)
- `sha1` - SHA1 hashsum (if present)
There is no field for composite file name as it's simply equal to the name
of meta object on the wrapped remote. Please refer to respective sections
for detils on hashsums and modified time handling.
#### WedDavMailRu compatible metadata format
The `wdmrcompat` metadata format is only useful to support historical files
created by [WebDriveMailru](https://github.com/yar229/WebDavMailRuCloud).
It keeps the following fields (most are ignored, though):
- `Name` - name of the composite file (always equal to the meta file name)
- `Size` - total size of chunks
- `PublicKey` - ignored, always "null"
- `CreationDate` - last modification (sic!) time, ignored.
for detils on hashsums and handling of modified time.
#### No metadata
@ -161,8 +173,8 @@ errors (especially missing last chunk) than metadata-enabled formats.
### Hashsums
Chunker supports hashsums only when a compatible metadata is present.
Thus, if you choose metadata format of `none` or `wdmrcompat`, chunker
will return `UNSUPPORTED` as hashsum.
Thus, if you choose metadata format of `none`, chunker will return
`UNSUPPORTED` as hashsum.
Please note that metadata is stored only for composite files. If a file
is small (smaller than configured chunk size), chunker will transparently
@ -175,16 +187,16 @@ Currently you can choose one or another but not both.
MD5 is set by default as the most supported type.
Since chunker keeps hashes for composite files and falls back to the
wrapped remote hash for small ones, we advise you to choose the same
hash type as wrapped remote, so your file listings look coherent.
hash type as wrapped remote so that your file listings look coherent.
Normally, when a file is copied to chunker controlled remote, chunker
will ask its source for compatible file hash and revert to on-the-fly
Normally, when a file is copied to a chunker controlled remote, chunker
will ask the file source for compatible file hash and revert to on-the-fly
calculation if none is found. This involves some CPU overhead but provides
a guarantee that given hashsum is available. Also, chunker will reject
a server-side copy or move operation if source and destination hashsum
types are different, resulting in the extra network bandwidth, too.
In some rare cases this may be undesired, so chunker provides two optional
choices: `sha1quick` and `md5quick`. If source does not have the primary
choices: `sha1quick` and `md5quick`. If the source does not support primary
hash type and the quick mode is enabled, chunker will try to fall back to
the secondary type. This will save CPU and bandwidth but can result in empty
hashsums at destination. Beware of consequences: the `sync` command will
@ -215,13 +227,14 @@ chunk naming scheme is to:
hash type, chunk naming etc.
- Now run `rclone sync oldchunks: newchunks:` and all your data
will be transparently converted at transfer.
This may take some time.
This may take some time, yet chunker will try server-side
copy if possible.
- After checking data integrity you may remove configuration section
of the old remote.
If rclone gets killed during a long operation on a big composite file,
hidden temporary chunks may stay in the directory. They will not be
shown by the list command but will eat up your account quota.
shown by the `list` command but will eat up your account quota.
Please note that the `deletefile` rclone command deletes only active
chunks of a file. As a workaround, you can use remote of the wrapped
file system to see them.
@ -234,17 +247,18 @@ remove everything including garbage.
### Caveats and Limitations
Chunker requires wrapped remote to support server side `move` (or `copy` +
delete) operations, otherwise it will explicitly refuse to start.
`delete`) operations, otherwise it will explicitly refuse to start.
This is because it internally renames temporary chunk files to their final
names when an operation completes successfully.
Note that moves done using the copy-and-delete method may incur double
charging with some cloud storage providers.
Note that a move implemented using the copy-and-delete method may incur
double charging with some cloud storage providers.
Chunker will not automatically rename existing chunks when you change the
chunk name format. Beware that in result of this some files which have been
treated as chunks before the change can pop up in directory listings as
normal files and vice versa. The same warning holds for the chunk size.
Chunker will not automatically rename existing chunks when you run
`rclone config` on a live remote and change the chunk name format.
Beware that in result of this some files which have been treated as chunks
before the change can pop up in directory listings as normal files
and vice versa. The same warning holds for the chunk size.
If you desperately need to change critical chunking setings, you should
run data migration as described in a dedicated section.
@ -278,6 +292,28 @@ Files larger than chunk size will be split in chunks.
- Type: SizeSuffix
- Default: 2G
#### --chunker-hash-type
Choose how chunker handles hash sums.
- Config: hash_type
- Env Var: RCLONE_CHUNKER_HASH_TYPE
- Type: string
- Default: "md5"
- Examples:
- "none"
- Chunker can pass any hash supported by wrapped remote
- for a single-chunk file but returns nothing otherwise.
- "md5"
- MD5 for multi-chunk files. Requires "simplejson".
- "sha1"
- SHA1 for multi-chunk files. Requires "simplejson".
- "md5quick"
- Copying a file to chunker will request MD5 from the source
- falling back to SHA1 if unsupported. Requires "simplejson".
- "sha1quick"
- Similar to "md5quick" but prefers SHA1 over MD5. Requires "simplejson".
### Advanced Options
Here are the advanced options specific to chunker (Transparently chunk/split large files).
@ -321,33 +357,6 @@ Metadata is a small JSON file named after the composite file.
- "simplejson"
- Simple JSON supports hash sums and chunk validation.
- It has the following fields: size, nchunks, md5, sha1.
- "wdmrcompat"
- This format brings compatibility with WebDavMailRuCloud.
- It does not support hash sums or validation, most fields are ignored.
- It has the following fields: Name, Size, PublicKey, CreationDate.
- Requires hash type "none".
#### --chunker-hash-type
Choose how chunker handles hash sums.
- Config: hash_type
- Env Var: RCLONE_CHUNKER_HASH_TYPE
- Type: string
- Default: "md5"
- Examples:
- "none"
- Chunker can pass any hash supported by wrapped remote
- for a single-chunk file but returns nothing otherwise.
- "md5"
- MD5 for multi-chunk files. Requires "simplejson".
- "sha1"
- SHA1 for multi-chunk files. Requires "simplejson".
- "md5quick"
- When a file is copied on to chunker, MD5 is taken from its source
- falling back to SHA1 if the source doesn't support it. Requires "simplejson".
- "sha1quick"
- Similar to "md5quick" but prefers SHA1 over MD5. Requires "simplejson".
#### --chunker-fail-on-bad-chunks