forked from TrueCloudLab/rclone
hasher: backend documentation #5587
This commit is contained in:
parent
f102ef2161
commit
0d7426a2dd
5 changed files with 339 additions and 2 deletions
|
@ -44,8 +44,9 @@ using local disk.
|
|||
|
||||
Virtual backends wrap local and cloud file systems to apply
|
||||
[encryption](/crypt/),
|
||||
[compression](/compress/)
|
||||
[chunking](/chunker/) and
|
||||
[compression](/compress/),
|
||||
[chunking](/chunker/),
|
||||
[hashing](/hasher/) and
|
||||
[joining](/union/).
|
||||
|
||||
Rclone [mounts](/commands/rclone_mount/) any local, cloud or
|
||||
|
|
|
@ -44,6 +44,7 @@ See the following for detailed instructions for
|
|||
* [Google Cloud Storage](/googlecloudstorage/)
|
||||
* [Google Drive](/drive/)
|
||||
* [Google Photos](/googlephotos/)
|
||||
* [Hasher](/hasher/) - to handle checksums for other remotes
|
||||
* [HDFS](/hdfs/)
|
||||
* [HTTP](/http/)
|
||||
* [Hubic](/hubic/)
|
||||
|
|
|
@ -359,6 +359,10 @@ and may be set in the config file.
|
|||
--gphotos-start-year int Year limits the photos to be downloaded to those which are uploaded after the given year (default 2000)
|
||||
--gphotos-token string OAuth Access Token as a JSON blob.
|
||||
--gphotos-token-url string Token server url.
|
||||
--hasher-auto-size SizeSuffix Auto-update checksum for files smaller than this size (disabled by default).
|
||||
--hasher-hashes CommaSepList Comma separated list of supported checksum types. (default md5,sha1)
|
||||
--hasher-max-age Duration Maximum time to keep checksums in cache (0 = no cache, off = cache forever). (default off)
|
||||
--hasher-remote string Remote to cache checksums for (e.g. myRemote:path).
|
||||
--hdfs-data-transfer-protection string Kerberos data transfer protection: authentication|integrity|privacy
|
||||
--hdfs-encoding MultiEncoder This sets the encoding for the backend. (default Slash,Colon,Del,Ctl,InvalidUtf8,Dot)
|
||||
--hdfs-namenode string hadoop name node and port
|
||||
|
|
330
docs/content/hasher.md
Normal file
330
docs/content/hasher.md
Normal file
|
@ -0,0 +1,330 @@
|
|||
---
|
||||
title: "Hasher"
|
||||
description: "Better checksums for other remotes"
|
||||
---
|
||||
|
||||
# {{< icon "fa fa-check-double" >}} Hasher (EXPERIMENTAL)
|
||||
|
||||
Hasher is a special overlay backend to create remotes which handle
|
||||
checksums for other remotes. It's main functions include:
|
||||
- Emulate hash types unimplemented by backends
|
||||
- Cache checksums to help with slow hashing of large local or (S)FTP files
|
||||
- Warm up checksum cache from external SUM files
|
||||
|
||||
## Getting started
|
||||
|
||||
To use Hasher, first set up the underlying remote following the configuration
|
||||
instructions for that remote. You can also use a local pathname instead of
|
||||
a remote. Check that your base remote is working.
|
||||
|
||||
Let's call the base remote `myRemote:path` here. Note that anything inside
|
||||
`myRemote:path` will be handled by hasher and anything outside won't.
|
||||
This means that if you are using a bucket based remote (S3, B2, Swift)
|
||||
then you should put the bucket in the remote `s3:bucket`.
|
||||
|
||||
Now proceed to interactive or manual configuration.
|
||||
|
||||
### Interactive configuration
|
||||
|
||||
Run `rclone config`:
|
||||
```
|
||||
No remotes found - make a new one
|
||||
n) New remote
|
||||
s) Set configuration password
|
||||
q) Quit config
|
||||
n/s/q> n
|
||||
name> Hasher1
|
||||
Type of storage to configure.
|
||||
Choose a number from below, or type in your own value
|
||||
[snip]
|
||||
XX / Handle checksums for other remotes
|
||||
\ "hasher"
|
||||
[snip]
|
||||
Storage> hasher
|
||||
Remote to cache checksums for, like myremote:mypath.
|
||||
Enter a string value. Press Enter for the default ("").
|
||||
remote> myRemote:path
|
||||
Comma separated list of supported checksum types.
|
||||
Enter a string value. Press Enter for the default ("md5,sha1").
|
||||
hashsums> md5
|
||||
Maximum time to keep checksums in cache. 0 = no cache, off = cache forever.
|
||||
max_age> off
|
||||
Edit advanced config? (y/n)
|
||||
y) Yes
|
||||
n) No
|
||||
y/n> n
|
||||
Remote config
|
||||
--------------------
|
||||
[Hasher1]
|
||||
type = hasher
|
||||
remote = myRemote:path
|
||||
hashsums = md5
|
||||
max_age = off
|
||||
--------------------
|
||||
y) Yes this is OK
|
||||
e) Edit this remote
|
||||
d) Delete this remote
|
||||
y/e/d> y
|
||||
```
|
||||
|
||||
### Manual configuration
|
||||
|
||||
Run `rclone config path` to see the path of current active config file,
|
||||
usually `YOURHOME/.config/rclone/rclone.conf`.
|
||||
Open it in your favorite text editor, find section for the base remote
|
||||
and create new section for hasher like in the following examples:
|
||||
|
||||
```
|
||||
[Hasher1]
|
||||
type = hasher
|
||||
remote = myRemote:path
|
||||
hashes = md5
|
||||
max_age = off
|
||||
|
||||
[Hasher2]
|
||||
type = hasher
|
||||
remote = /local/path
|
||||
hashes = dropbox,sha1
|
||||
max_age = 24h
|
||||
```
|
||||
|
||||
Hasher takes basically the following parameters:
|
||||
- `remote` is required,
|
||||
- `hashes` is a comma separated list of supported checksums
|
||||
(by default `md5,sha1`),
|
||||
- `max_age` - maximum time to keep a checksum value in the cache,
|
||||
`0` will disable caching completely,
|
||||
`off` will cache "forever" (that is until the files get changed).
|
||||
|
||||
Make sure the `remote` has `:` (colon) in. If you specify the remote without
|
||||
a colon then rclone will use a local directory of that name. So if you use
|
||||
a remote of `/local/path` then rclone will handle hashes for that directory.
|
||||
If you use `remote = name` literally then rclone will put files
|
||||
**in a directory called `name` located under current directory**.
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic operations
|
||||
|
||||
Now you can use it as `Hasher2:subdir/file` instead of base remote.
|
||||
Hasher will transparently update cache with new checksums when a file
|
||||
is fully read or overwritten, like:
|
||||
```
|
||||
rclone copy External:path/file Hasher:dest/path
|
||||
|
||||
rclone cat Hasher:path/to/file > /dev/null
|
||||
```
|
||||
|
||||
The way to refresh **all** cached checksums (even unsupported by the base backend)
|
||||
for a subtree is to **re-download** all files in the subtree. For example,
|
||||
use `hashsum --download` using **any** supported hashsum on the command line
|
||||
(we just care to re-read):
|
||||
```
|
||||
rclone hashsum MD5 --download Hasher:path/to/subtree > /dev/null
|
||||
|
||||
rclone backend dump Hasher:path/to/subtree
|
||||
```
|
||||
|
||||
You can print or drop hashsum cache using custom backend commands:
|
||||
```
|
||||
rclone backend dump Hasher:dir/subdir
|
||||
|
||||
rclone backend drop Hasher:
|
||||
```
|
||||
|
||||
### Pre-Seed from a SUM File
|
||||
|
||||
Hasher supports two backend commands: generic SUM file `import` and faster
|
||||
but less consistent `stickyimport`.
|
||||
|
||||
```
|
||||
rclone backend import Hasher:dir/subdir SHA1 /path/to/SHA1SUM [--checkers 4]
|
||||
```
|
||||
|
||||
Instead of SHA1 it can be any hash supported by the remote. The last argument
|
||||
can point to either a local or an `other-remote:path` text file in SUM format.
|
||||
The command will parse the SUM file, then walk down the path given by the
|
||||
first argument, snapshot current fingerprints and fill in the cache entries
|
||||
correspondingly.
|
||||
- Paths in the SUM file are treated as relative to `hasher:dir/subdir`.
|
||||
- The command will **not** check that supplied values are correct.
|
||||
You **must know** what you are doing.
|
||||
- This is a one-time action. The SUM file will not get "attached" to the
|
||||
remote. Cache entries can still be overwritten later, should the object's
|
||||
fingerprint change.
|
||||
- The tree walk can take long depending on the tree size. You can increase
|
||||
`--checkers` to make it faster. Or use `stickyimport` if you don't care
|
||||
about fingerprints and consistency.
|
||||
|
||||
```
|
||||
rclone backend stickyimport hasher:path/to/data sha1 remote:/path/to/sum.sha1
|
||||
```
|
||||
|
||||
`stickyimport` is similar to `import` but works much faster because it
|
||||
does not need to stat existing files and skips initial tree walk.
|
||||
Instead of binding cache entries to file fingerprints it creates _sticky_
|
||||
entries bound to the file name alone ignoring size, modification time etc.
|
||||
Such hash entries can be replaced only by `purge`, `delete`, `backend drop`
|
||||
or by full re-read/re-write of the files.
|
||||
|
||||
## Configuration reference
|
||||
|
||||
{{< rem autogenerated options start" - DO NOT EDIT - instead edit fs.RegInfo in backend/hasher/hasher.go then run make backenddocs" >}}
|
||||
### Standard Options
|
||||
|
||||
Here are the standard options specific to hasher (Better checksums for other remotes).
|
||||
|
||||
#### --hasher-remote
|
||||
|
||||
Remote to cache checksums for (e.g. myRemote:path).
|
||||
|
||||
- Config: remote
|
||||
- Env Var: RCLONE_HASHER_REMOTE
|
||||
- Type: string
|
||||
- Default: ""
|
||||
|
||||
#### --hasher-hashes
|
||||
|
||||
Comma separated list of supported checksum types.
|
||||
|
||||
- Config: hashes
|
||||
- Env Var: RCLONE_HASHER_HASHES
|
||||
- Type: CommaSepList
|
||||
- Default: md5,sha1
|
||||
|
||||
#### --hasher-max-age
|
||||
|
||||
Maximum time to keep checksums in cache (0 = no cache, off = cache forever).
|
||||
|
||||
- Config: max_age
|
||||
- Env Var: RCLONE_HASHER_MAX_AGE
|
||||
- Type: Duration
|
||||
- Default: off
|
||||
|
||||
### Advanced Options
|
||||
|
||||
Here are the advanced options specific to hasher (Better checksums for other remotes).
|
||||
|
||||
#### --hasher-auto-size
|
||||
|
||||
Auto-update checksum for files smaller than this size (disabled by default).
|
||||
|
||||
- Config: auto_size
|
||||
- Env Var: RCLONE_HASHER_AUTO_SIZE
|
||||
- Type: SizeSuffix
|
||||
- Default: 0
|
||||
|
||||
### Backend commands
|
||||
|
||||
Here are the commands specific to the hasher backend.
|
||||
|
||||
Run them with
|
||||
|
||||
rclone backend COMMAND remote:
|
||||
|
||||
The help below will explain what arguments each command takes.
|
||||
|
||||
See [the "rclone backend" command](/commands/rclone_backend/) for more
|
||||
info on how to pass options and arguments.
|
||||
|
||||
These can be run on a running backend using the rc command
|
||||
[backend/command](/rc/#backend/command).
|
||||
|
||||
#### drop
|
||||
|
||||
Drop cache
|
||||
|
||||
rclone backend drop remote: [options] [<arguments>+]
|
||||
|
||||
Completely drop checksum cache.
|
||||
Usage Example:
|
||||
rclone backend drop hasher:
|
||||
|
||||
|
||||
#### dump
|
||||
|
||||
Dump the database
|
||||
|
||||
rclone backend dump remote: [options] [<arguments>+]
|
||||
|
||||
Dump cache records covered by the current remote
|
||||
|
||||
#### fulldump
|
||||
|
||||
Full dump of the database
|
||||
|
||||
rclone backend fulldump remote: [options] [<arguments>+]
|
||||
|
||||
Dump all cache records in the database
|
||||
|
||||
#### import
|
||||
|
||||
Import a SUM file
|
||||
|
||||
rclone backend import remote: [options] [<arguments>+]
|
||||
|
||||
Amend hash cache from a SUM file and bind checksums to files by size/time.
|
||||
Usage Example:
|
||||
rclone backend import hasher:subdir md5 /path/to/sum.md5
|
||||
|
||||
|
||||
#### stickyimport
|
||||
|
||||
Perform fast import of a SUM file
|
||||
|
||||
rclone backend stickyimport remote: [options] [<arguments>+]
|
||||
|
||||
Fill hash cache from a SUM file without verifying file fingerprints.
|
||||
Usage Example:
|
||||
rclone backend stickyimport hasher:subdir md5 remote:path/to/sum.md5
|
||||
|
||||
|
||||
{{< rem autogenerated options stop >}}
|
||||
|
||||
## Implementation details (advanced)
|
||||
|
||||
This section explains how various rclone operations work on a hasher remote.
|
||||
|
||||
**Disclaimer. This section describes current implementation which can
|
||||
change in future rclone versions!.**
|
||||
|
||||
### Hashsum command
|
||||
|
||||
The `rclone hashsum` (or `md5sum` or `sha1sum`) command will:
|
||||
|
||||
1. if requested hash is supported by lower level, just pass it.
|
||||
2. if object size is below `auto_size` then download object and calculate
|
||||
_requested_ hashes on the fly.
|
||||
3. if unsupported and the size is big enough, build object `fingerprint`
|
||||
(including size, modtime if supported, first-found _other_ hash if any).
|
||||
4. if the strict match is found in cache for the requested remote, return
|
||||
the stored hash.
|
||||
5. if remote found but fingerprint mismatched, then purge the entry and
|
||||
proceed to step 6.
|
||||
6. if remote not found or had no requested hash type or after step 5:
|
||||
download object, calculate all _supported_ hashes on the fly and store
|
||||
in cache; return requested hash.
|
||||
|
||||
### Other operations
|
||||
|
||||
- whenever a file is uploaded or downloaded **in full**, capture the stream
|
||||
to calculate all supported hashes on the fly and update database
|
||||
- server-side `move` will update keys of existing cache entries
|
||||
- `deletefile` will remove a single cache entry
|
||||
- `purge` will remove all cache entries under the purged path
|
||||
|
||||
Note that setting `max_age = 0` will disable checksum caching completely.
|
||||
|
||||
If you set `max_age = off`, checksums in cache will never age, unless you
|
||||
fully rewrite or delete the file.
|
||||
|
||||
### Cache storage
|
||||
|
||||
Cached checksums are stored as `bolt` database files under rclone cache
|
||||
directory, usually `~/.cache/rclone/kv/`. Databases are maintained
|
||||
one per _base_ backend, named like `BaseRemote~hasher.bolt`.
|
||||
Checksums for multiple `alias`-es into a single base backend
|
||||
will be stored in the single database. All local paths are treated as
|
||||
aliases into the `local` backend (unless crypted or chunked) and stored
|
||||
in `~/.cache/rclone/kv/local~hasher.bolt`.
|
||||
Databases can be shared between multiple rclone processes.
|
|
@ -77,6 +77,7 @@
|
|||
<a class="dropdown-item" href="/googlecloudstorage/"><i class="fab fa-google"></i> Google Cloud Storage</a>
|
||||
<a class="dropdown-item" href="/drive/"><i class="fab fa-google"></i> Google Drive</a>
|
||||
<a class="dropdown-item" href="/googlephotos/"><i class="fas fa-images"></i> Google Photos</a>
|
||||
<a class="dropdown-item" href="/hasher/"><i class="fa fa-check-double"></i> Hasher (better checksums for others)</a>
|
||||
<a class="dropdown-item" href="/hdfs/"><i class="fa fa-globe"></i> HDFS (Hadoop Distributed Filesystem)</a>
|
||||
<a class="dropdown-item" href="/http/"><i class="fa fa-globe"></i> HTTP</a>
|
||||
<a class="dropdown-item" href="/hubic/"><i class="fa fa-space-shuttle"></i> Hubic</a>
|
||||
|
|
Loading…
Reference in a new issue