问题描述
尝试显示SparkDF(测试)时,出现KeyError,如下所示.我在Test.show(3)
之前使用的函数中可能出了点问题.
When trying to show a SparkDF (Test), I get a KeyError, as shown below. Probably something goes wrong in the function I used before Test.show(3)
.
KeyError说:KeyError:"SPARK_HOME".我假设SPARK_HOME没有在master和/或worker上定义.有没有一种方法可以自动在两个目录上都指定SPARK_HOME目录?最好使用初始化动作.
The KeyError says: KeyError: 'SPARK_HOME'.I assume SPARK_HOME is not defined on the master and/or workers. Is there a way I can specify the SPARK_HOME directory automatically on both? Preferably by using a initialization action.
Py4JJavaErrorTraceback(最近一次通话) 在 () ----> 1个Test.show(3)
Py4JJavaErrorTraceback (most recent call last) in () ----> 1 Test.show(3)
/usr/lib/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate)
255 +---+-----+
256 """
--> 257 print(self._jdf.showString(n, truncate))
258
259 def __repr__(self):
...
raise KeyError(key)
KeyError: 'SPARK_HOME'
推荐答案
您可以将以下内容简单地放在初始化操作中:
You can simply put the following in an initialization action:
#!/bin/bash
cat << EOF | tee -a /etc/profile.d/custom_env.sh /etc/*bashrc >/dev/null
export SPARK_HOME=/usr/lib/spark/
EOF
您需要将该init操作放在jupyter安装操作之前,以确保jupyter进程启动时该init操作存在.
You'll want to put that init action before your jupyter installation action to make sure that it's present when the jupyter process starts up.
要指定这两个init动作,您可以将它们以逗号分隔的列表形式列出,且不带空格,如下所示:
To specify the two init actions, you can list them in a comma-separated list without spaces, like this:
gcloud dataproc clusters create \
--initialization-actions gs://mybucket/spark_home.sh,gs://mybucket/jupyter.sh ...
这篇关于KeyError:Google-Cloud-DataProc上Jupyter上pyspark中的"SPARK_HOME"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!