问题描述
我有一个S3存储桶,其中约有400万个文件,总共约500GB。我需要将文件同步到一个新的存储桶(实际上,更改存储桶的名称就足够了,但是由于那不可能,我需要创建一个新的存储桶,将文件移到那里,然后删除旧的存储桶)。
I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).
我正在使用AWS CLI的 s3 sync
命令,它可以完成工作,但是会花费很多时间。我想减少时间,以使从属系统停机时间最小化。
I'm using AWS CLI's s3 sync
command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.
我试图从本地计算机运行同步并且来自 EC2 c4.xlarge
实例,所花费的时间没有太大差异。
I was trying to run the sync both from my local machine and from EC2 c4.xlarge
instance and there isn't much difference in time taken.
我注意到,当我使用将工作分成多批时,可以节省一些时间-排除
和-包括
选项,并在单独的终端窗口中并行运行它们,即
I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude
and --include
options and run them in parallel from separate terminal windows, i.e.
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"
还有其他我可以做的事情来加快同步速度吗? EC2
实例的另一种类型是否更适合该工作?将工作分成多个批次是否是个好主意,并且是否有最佳数量的 sync
进程可以在同一存储桶中并行运行?
Is there anything else I can do speed up the sync even more? Is another type of EC2
instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync
processes that can run in parallel on the same bucket?
更新
我倾向于在关闭系统之前同步存储桶的策略,执行迁移,然后再次同步存储桶以仅复制同时更改的少量文件。但是,即使在没有差异的存储桶上运行相同的 sync
命令也要花费很多时间。
I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync
command even on buckets with no differences takes a lot of time.
推荐答案
您可以使用EMR和S3-distcp。我必须在两个存储桶之间同步153 TB,这大约花了9天。另外,请确保存储桶位于同一区域,因为您还会受到数据传输成本的打击。
You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.
aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]
这篇关于同步两个Amazon S3存储桶的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!