

我正在提交一个导入numpy的python文件,但出现no module named numpy错误.

I’m spark-submitting a python file that imports numpy but I’m getting a no module named numpy error.

$ spark-submit --py-files projects/other_requirements.egg projects/jobs/my_numpy_als.py
Traceback (most recent call last):
  File "/usr/local/www/my_numpy_als.py", line 13, in <module>
    from pyspark.mllib.recommendation import ALS
  File "/usr/lib/spark/python/pyspark/mllib/__init__.py", line 24, in <module>
    import numpy
ImportError: No module named numpy

我当时想为numpy -python文件添加一个鸡蛋,但是在弄清楚如何制作该鸡蛋时遇到了麻烦.但是后来我想到pyspark本身使用numpy.引入我自己的numpy版本会很愚蠢.

I was thinking I would pull in an egg for numpy —python-files, but I'm having trouble figuring out how to build that egg. But then it occurred to me that pyspark itself uses numpy. It would be silly to pull in my own version of numpy.


Any idea on the appropriate thing to do here?



It looks like Spark is using a version of Python that does not have numpy installed. It could be because you are working inside a virtual environment.


# The following is for specifying a Python version for PySpark. Here we
# use the currently calling Python version.
# This is handy for when we are using a virtualenv, for example, because
# otherwise Spark would choose the default system Python version.
os.environ['PYSPARK_PYTHON'] = sys.executable


