问题描述
我在与星火使用Python的一个问题。我的应用程序有一定的相关性,比如numpy的,熊猫,astropy,等等。我不能使用的virtualenv创建具有所有依赖性的环境,因为集群中的节点没有任何共同的挂载点或文件系统,除了HDFS。所以我坚持使用火花提交--py-文件
。我包站点包的内容ZIP文件和提交作业一样 - PY-文件= dependencies.zip
选项(如Easiest办法对星火执行人节点安装Python的依赖?)。然而,在集群中的节点似乎仍然没有看到里面的模块和他们扔的ImportError
像这样的进口时,numpy的。
I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS. Therefore I am stuck with using spark-submit --py-files
. I package the contents of site-packages in a ZIP file and submit the job like with --py-files=dependencies.zip
option (as suggested in Easiest way to install Python dependencies on Spark executor nodes?). However, the nodes on cluster still do not seem to see the modules inside and they throw ImportError
such as this when importing numpy.
File "/path/anonymized/module.py", line 6, in <module>
import numpy
File "/tmp/pip-build-4fjFLQ/numpy/numpy/__init__.py", line 180, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/add_newdocs.py", line 13, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/__init__.py", line 8, in <module>
#
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/type_check.py", line 11, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/core/__init__.py", line 14, in <module>
ImportError: cannot import name multiarray
当我切换到的virtualenv和使用本地pyspark壳,一切正常,所以依赖关系都在那里。有谁知道,什么可能会导致这样的问题,以及如何解决它?
When I switch to the virtualenv and use the local pyspark shell, everything works fine, so the dependencies are all there. Does anyone know, what might cause this problem and how to fix it?
谢谢!
推荐答案
您可以找到您需要的所有.pys并将它们添加比较。
看到这样的解释:
You can locate all the .pys you need and add them relatively.see here for this explanation:
import os, sys, inspect
# realpath() will make your script run, even if you symlink it :)
cmd_folder = os.path.realpath(os.path.abspath(os.path.split(inspect.getfile( inspect.currentframe() ))[0]))
if cmd_folder not in sys.path:
sys.path.insert(0, cmd_folder)
# use this if you want to include modules from a subfolder
cmd_subfolder = os.path.realpath(os.path.abspath(os.path.join(os.path.split(inspect.getfile( inspect.currentframe() ))[0],"subfolder")))
if cmd_subfolder not in sys.path:
sys.path.insert(0, cmd_subfolder)
# Info:
# cmd_folder = os.path.dirname(os.path.abspath(__file__)) # DO NOT USE __file__ !!!
# __file__ fails if script is called in different ways on Windows
# __file__ fails if someone does os.chdir() before
# sys.argv[0] also fails because it doesn't not always contains the path
这篇关于我似乎无法获得星火--py档案工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!