本文介绍了Amazon Elastic MapReduce的Numpy和Scipy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


使用mrjob在亚马逊的Elastic MapReduce上运行python代码,我已经成功找到了一种升级EMR图像的numpy和scipy的方法.

Using the mrjob to run python code on Amazon's Elastic MapReduce I have successfully found a way to upgrade the EMR image's numpy and scipy.


Running from console the following commands work:

    tar -cvf py_bundle.tar mymain.py Utils.py numpy-1.6.1.tar.gz scipy-0.9.0.tar.gz

    gzip py_bundle.tar

    python my_mapper.py -r emr --python-archive py_bundle.tar.gz --bootstrap-python-package numpy-1.6.1.tar.gz --bootstrap-python-package scipy-0.9.0.tar.gz > output.txt


This successfully bootstraps the latest numpy and scipy into the image and works perfectly. My question is a matter of speed. This takes 21 minutes to install itself on a small instance.


Does anyone have any idea how to speed up the process of upgradingnumpy and scipy?


对EMR图像执行任何操作的唯一方法是使用引导操作.从控制台执行此操作意味着您将仅更改主节点,而不更改执行处理的任务节点. Bootstrap操作在启动时在所有节点上运行一次,并且可以是一个简单的脚本,可以执行Shell.

The only way to do anything to an EMR image is by using bootstrap actions. Doing this from the console means you'll only change the master node and not the task nodes which do the processing. Bootstrap actions run once at startup on all nodes and can be a simple script that gets shell exec'd.

elastic-mapreduce --create --bootstrap-action "s3://bucket/path/to/script" ...


To speed up changes to the EMR image, tar up the post-installed files and upload to S3. Then use a bootstrap action to download and deploy. You will have to keep separate archives for 32 bit (micro, small, medium) and 64 bit machines.


The command to download from S3 in the script is:

hadoop fs -get s3://bucket/path/to/archive /tmp/archive

这篇关于Amazon Elastic MapReduce的Numpy和Scipy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 04:25