From 506342317be45aefaaaed1223953a01575a24702 Mon Sep 17 00:00:00 2001 From: Nick Craig-Wood Date: Thu, 26 Nov 2020 15:00:10 +0000 Subject: [PATCH] s3: update docs with a Reducing Costs section - Fixes #2889 --- docs/content/s3.md | 100 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 81 insertions(+), 19 deletions(-) diff --git a/docs/content/s3.md b/docs/content/s3.md index 2895b4e16..0312195cd 100644 --- a/docs/content/s3.md +++ b/docs/content/s3.md @@ -248,25 +248,6 @@ d) Delete this remote y/e/d> ``` -### --fast-list ### - -This remote supports `--fast-list` which allows you to use fewer -transactions in exchange for more memory. See the [rclone -docs](/docs/#fast-list) for more details. - -### --update and --use-server-modtime ### - -As noted below, the modified time is stored on metadata on the object. It is -used by default for all operations that require checking the time a file was -last updated. It allows rclone to treat the remote more like a true filesystem, -but it is inefficient because it requires an extra API call to retrieve the -metadata. - -For many operations, the time the object was last uploaded to the remote is -sufficient to determine if it is "dirty". By using `--update` along with -`--use-server-modtime`, you can avoid the extra API call and simply upload -files whose local modtime is newer than the time it was last uploaded. - ### Modified time ### The modified time is stored as metadata on the object as @@ -280,6 +261,87 @@ storage the object will be uploaded rather than copied. Note that reading this from the object takes an additional `HEAD` request as the metadata isn't returned in object listings. +### Reducing costs + +#### Avoiding HEAD requests to read the modification time + +By default rclone will use the modification time of objects stored in +S3 for syncing. This is stored in object metadata which unfortunately +takes an extra HEAD request to read which can be expensive (in time +and money). + +The modification time is used by default for all operations that +require checking the time a file was last updated. It allows rclone to +treat the remote more like a true filesystem, but it is inefficient on +S3 because it requires an extra API call to retrieve the metadata. + +The extra API calls can be avoided when syncing (using `rclone sync` +or `rclone copy`) in a few different ways, each with its own +tradeoffs. + +- `--size-only` + - Only checks the size of files. + - Uses no extra transactions. + - If the file doesn't change size then rclone won't detect it has + changed. + - `rclone sync --size-only /path/to/source s3:bucket` +- `--checksum` + - Checks the size and MD5 checksum of files. + - Uses no extra transactions. + - The most accurate detection of changes possible. + - Will cause the source to read an MD5 checksum which, if it is a + local disk, will cause lots of disk activity. + - If the source and destination are both S3 this is the + **recommended** flag to use for maximum efficiency. + - `rclone sync --checksum /path/to/source s3:bucket` +- `--update --use-server-modtime` + - Uses no extra transactions. + - Modification time becomes the time the object was uploaded. + - For many operations this is sufficient to determine if it needs + uploading. + - Using `--update` along with `--use-server-modtime`, avoids the + extra API call and uploads files whose local modification time + is newer than the time it was last uploaded. + - Files created with timestamps in the past will be missed by the sync. + - `rclone sync --update --use-server-modtime /path/to/source s3:bucket` + +These flags can and should be used in combination with `--fast-list` - +see below. + +If using `rclone mount` or any command using the VFS (eg `rclone +serve`) commands then you might want to consider using the VFS flag +`--no-modtime` which will stop rclone reading the modification time +for every object. You could also use `--use-server-modtime` if you are +happy with the modification times of the objects being the time of +upload. + +#### Avoiding GET requests to read directory listings + +Rclone's default directory traversal is to process each directory +individually. This takes one API call per directory. Using the +`--fast-list` flag will read all info about the the objects into +memory first using a smaller number of API calls (one per 1000 +objects). See the [rclone docs](/docs/#fast-list) for more details. + + rclone sync --fast-list --checksum /path/to/source s3:bucket + +`--fast-list` trades off API transactions for memory use. As a rough +guide rclone uses 1k of memory per object stored, so using +`--fast-list` on a sync of a million objects will use roughly 1 GB of +RAM. + +If you are only copying a small number of files into a big repository +then using `--no-traverse` is a good idea. This finds objects directly +instead of through directory listings. You can do a "top-up" sync very +cheaply by using `--max-age` and `--no-traverse` to copy only recent +files, eg + + rclone copy --min-age 24h --no-traverse /path/to/source s3:bucket + +You'd then do a full `rclone sync` less often. + +Note that `--fast-list` isn't required in the top-up sync. + ### Hashes ### For small objects which weren't uploaded as multipart uploads (objects