1

I am using gcloud storage cp for transferring large amount of data from source bucket to destination bucket.

I am using --no-clobber option to skip the existing file if copied already.

gcloud storage cp -r --no-clobber "gs://test-1/*" "gs://test-2" --encryption-key=XXXXXXXXXXXXXXXX --storage-class=REGIONAL

One of the challenge is I am moving terabytes of data(all files of size KiloBytes) from one bucket to another bucket and source bucket is encrypted with CSEK(Customer supplied encryption keys).

GCP data transfer service doesn't work for buckets encrypted with CSEK.

Since I am aware that it will take lots of time so I will start this process on long running VMs. Now in case of intermittent network or zonal failures, we might have to restart the gcloud storage cp command.

For e.g. Copying from gs://test-1 to gs://test-2 took ~7.35 hours(with 837136 files | 3.5GiB size) from my local machine(Apple MacBook Pro M1 with 32GB RAM). Time taken was relatively high and can be due to overhead of encryption and decryption in cloud.

With --no-clobber, it will still make an call to see if an bucket object exists already which is an classB operation and will cost if all the millions of objects are retried again to check if an object exists on destination bucket or not.

Class B Operations

storage.*.get
storage.*.getIamPolicy
storage.*.testIamPermissions
storage.*AccessControls.list
storage.notifications.list

I checked we have a mechanism of manifest file but it didn't work in my case for buckets with CSEK. If manifest file can skip the files directly then it will be fantastic.

https://cloud.google.com/storage-transfer/docs/manifest#:~:text=A%20manifest%20is%20a%20CSV,to%20a%20Cloud%20Storage%20bucket.

Is there way to store offset and continue next time from that offset instead of checking all the objects if they exist first ?

SRJ
  • 2,092
  • 3
  • 17
  • 36
  • You can use the `gsutil` command-line tool to perform the copy operation with the `rsync` command, which has an option to restart an interrupted transfer from where it left off. [rsync - Synchronize content of two buckets/directories](https://cloud.google.com/storage/docs/gsutil/commands/rsync). Instead of checking to see if items exist in the target bucket, you can only transfer the files that have changed by using `gsutil rsync` rather than `gcloud storage cp`. – Chanpols Apr 07 '23 at 17:38
  • 1
    Thanks @Chanpols your suggestion is great however I am copying billion of files so I would need a faster solution and gsutil rsync spends lots of time building synchronisation list first and then start syncing process. However gcloud storage cp is fastest solution to copy as claimed by Google. – SRJ Apr 07 '23 at 17:57
  • In this case, you might want to consider using the `gcloud storage transfer` command instead. This command allows you to transfer data between buckets, and it can handle transferring large amounts of data efficiently. Reference: https://cloud.google.com/storage-transfer/docs/create-transfers – Chanpols Apr 07 '23 at 18:09
  • Thank you @Chanpols again. If you look at my question, i have already mentioned that I tried data transfer service but it doesn't support buckets with csek. – SRJ Apr 07 '23 at 18:39

0 Answers0