本文介绍了从Google Cloud Storage提取RAR文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用RAR实用程序得到了一个很大的多部分压缩CSV文件(未压缩100GB,已压缩20GB),所以我有100个RAR文件部分,这些部分已上传到Google Cloud Storage.我需要将其提取到Google Cloud Storage.最好是我可以在GAE上使用Python.有任何想法吗?我不想下载,提取和上传.我想在云端完成所有操作.

I got a big multipart compressed CSV file using RAR utility (100GB uncompressed, 20GB compressed), so I have 100 RAR file parts, that were uploaded to Google Cloud Storage. I need to extract it to Google Cloud Storage. It would be best if I could use Python on GAE. Any ideas? I don't want to download, extract, and upload. I want to do it all in the cloud.

推荐答案

无法直接在云中解压缩/解压缩RAR文件.您知道gsutil -m(多线程/多处理)选项吗?通过并行运行它们可以加快传输速度.我建议按照以下顺序进行操作:

There's no way to directly decompress/extract your RAR file in the cloud. Are you aware of the gsutil -m (multithreading/multiprocessing) option? It speeds up transfers by running them in parallel. I'd suggest this sequence:

  • 下载压缩的存档文件
  • 在本地解压缩
  • 使用gsutil -m cp file-pattern dest-bucket
  • 并行上传解压缩的文件
  • download compressed archive file
  • unpack locally
  • upload unpacked files in parallel using gsutil -m cp file-pattern dest-bucket

除非您的网络连接速度非常慢,否则20GB不会花很长时间(我希望不到一小时),同样对于并行上传(尽管这取决于获得的并行度)取决于存档文件的大小.

Unless you have a very slow internet connection, 20GB should not take very long (well under an hour, I'd expect) and likewise for the parallel upload (though that's a function of how much parallelism you get, which in turns depends on the size of the archive files).

顺便说一句,您可以通过$HOME/.boto文件中的parallel_thread_countparallel_process_count变量来调整gsutil -m使用的并行性.

Btw, you can tune the parallelism used by gsutil -m via the parallel_thread_count and parallel_process_count variables in your $HOME/.boto file.

这篇关于从Google Cloud Storage提取RAR文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 06:57