问题描述
我使用RAR实用程序得到了一个很大的多部分压缩CSV文件(未压缩100GB,已压缩20GB),所以我有100个RAR文件部分,这些部分已上传到Google Cloud Storage.我需要将其提取到Google Cloud Storage.最好是我可以在GAE上使用Python.有任何想法吗?我不想下载,提取和上传.我想在云端完成所有操作.
I got a big multipart compressed CSV file using RAR utility (100GB uncompressed, 20GB compressed), so I have 100 RAR file parts, that were uploaded to Google Cloud Storage. I need to extract it to Google Cloud Storage. It would be best if I could use Python on GAE. Any ideas? I don't want to download, extract, and upload. I want to do it all in the cloud.
推荐答案
无法直接在云中解压缩/解压缩RAR文件.您知道gsutil -m
(多线程/多处理)选项吗?通过并行运行它们可以加快传输速度.我建议按照以下顺序进行操作:
There's no way to directly decompress/extract your RAR file in the cloud. Are you aware of the gsutil -m
(multithreading/multiprocessing) option? It speeds up transfers by running them in parallel. I'd suggest this sequence:
- 下载压缩的存档文件
- 在本地解压缩
- 使用
gsutil -m cp file-pattern dest-bucket
并行上传解压缩的文件
- download compressed archive file
- unpack locally
- upload unpacked files in parallel using
gsutil -m cp file-pattern dest-bucket
除非您的网络连接速度非常慢,否则20GB不会花很长时间(我希望不到一小时),同样对于并行上传(尽管这取决于获得的并行度)取决于存档文件的大小.
Unless you have a very slow internet connection, 20GB should not take very long (well under an hour, I'd expect) and likewise for the parallel upload (though that's a function of how much parallelism you get, which in turns depends on the size of the archive files).
顺便说一句,您可以通过$HOME/.boto
文件中的parallel_thread_count
和parallel_process_count
变量来调整gsutil -m
使用的并行性.
Btw, you can tune the parallelism used by gsutil -m
via the parallel_thread_count
and parallel_process_count
variables in your $HOME/.boto
file.
这篇关于从Google Cloud Storage提取RAR文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!