DataProc上Jupyter上pyspark中的

DataProc上Jupyter上pyspark中的

本文介绍了KeyError:Google-Cloud-DataProc上Jupyter上pyspark中的"SPARK_HOME"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试显示SparkDF(测试)时,出现KeyError,如下所示.我在Test.show(3)之前使用的函数中可能出了点问题.

When trying to show a SparkDF (Test), I get a KeyError, as shown below. Probably something goes wrong in the function I used before Test.show(3).

KeyError说:KeyError:"SPARK_HOME".我假设SPARK_HOME没有在master和/或worker上定义.有没有一种方法可以自动在两个目录上都指定SPARK_HOME目录?最好使用初始化动作.

The KeyError says: KeyError: 'SPARK_HOME'.I assume SPARK_HOME is not defined on the master and/or workers. Is there a way I can specify the SPARK_HOME directory automatically on both? Preferably by using a initialization action.

Py4JJavaErrorTraceback(最近一次通话) 在 () ----> 1个Test.show(3)

Py4JJavaErrorTraceback (most recent call last) in () ----> 1 Test.show(3)

/usr/lib/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate)
    255         +---+-----+
    256         """
--> 257         print(self._jdf.showString(n, truncate))
    258
    259     def __repr__(self):

...

    raise KeyError(key)
KeyError: 'SPARK_HOME'

推荐答案

您可以将以下内容简单地放在初始化操作中:

You can simply put the following in an initialization action:

#!/bin/bash

cat << EOF | tee -a /etc/profile.d/custom_env.sh /etc/*bashrc >/dev/null
export SPARK_HOME=/usr/lib/spark/
EOF

您需要将该init操作放在jupyter安装操作之前,以确保jupyter进程启动时该init操作存在.

You'll want to put that init action before your jupyter installation action to make sure that it's present when the jupyter process starts up.

要指定这两个init动作,您可以将它们以逗号分隔的列表形式列出,且不带空格,如下所示:

To specify the two init actions, you can list them in a comma-separated list without spaces, like this:

gcloud dataproc clusters create \
    --initialization-actions gs://mybucket/spark_home.sh,gs://mybucket/jupyter.sh ...

这篇关于KeyError:Google-Cloud-DataProc上Jupyter上pyspark中的"SPARK_HOME"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 23:09