本文介绍了GCP Dataproc自定义图像Python环境的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

创建DataProc自定义映像和Pyspark时遇到问题.我的自定义映像基于DataProc 1.4.1-debian9,并使用初始化脚本从Requires.txt文件中安装python3和一些软件包,然后设置python3 env变量以强制pyspark使用python3.但是,当我在使用此映像创建的集群(为简单起见,使用单节点标志)上提交作业时,该作业找不到安装的软件包.如果我登录群集计算机并运行pyspark命令,则启动Anaconda PySpark,但是如果我以root用户身份登录并运行pyspark,则我的pyspark具有python 3.5.3.这是很奇怪的.我不知道是哪个用户用来创建图像?为什么我的用户和root用户拥有不同的环境?我希望映像是用root用户配置的,所以我希望可以从root用户那里找到我安装的所有软件包.预先感谢

I have an issue when I create a DataProc custom image and Pyspark.My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env variable to force pyspark to use python3.But when I submit a job on a cluster created (with single node flag for simplicity) with this image, the job can't find the packages installed.If I log on the cluster machine and run the pyspark command, starts the Anaconda PySpark, but if I log on with root user and run pyspark I have the pyspark with python 3.5.3.This is a very strange.What I don't understand is which user is used to create the image?Why I have a different environment for my user and root user?I expect that the image is provisioned with root user, so I expect that all my packages installed could be found from root user.Thanks in advance

推荐答案

更新后的答案(2021年第二季度)

customize_conda.sh 脚本是推荐的自定义图片自定义Conda env的方法.

Updated answer (Q2 2021)

The customize_conda.sh script is the recommended way of customizing Conda env for custom images.

如果您需要的不仅仅是脚本,您可以阅读代码并创建自己的脚本,但是通常您希望使用绝对路径,例如/opt/conda/anaconda/bin/conda /opt/conda/anaconda/bin/pip /opt/conda/miniconda3/bin/conda /opt/conda/miniconda3/bin/pip 来安装/卸载Anaconda/Miniconda env的软件包.

If you need more than the script does, you can read the code and create your own script, but usually you want to use the absolute path e.g., /opt/conda/anaconda/bin/conda, /opt/conda/anaconda/bin/pip, /opt/conda/miniconda3/bin/conda, /opt/conda/miniconda3/bin/pip to install/uninstall packages for the Anaconda/Miniconda env.

建议您首先阅读配置集群的Python环境,其中概述了不同映像版本上Dataproc的Python环境,以及有关如何安装软件包以及为PySpark作业选择Python的说明.

I'd recommend you first read Configure the cluster's Python environment which gives an overview of Dataproc's Python environment on different image versions, as well as instructions on how to install packages and select Python for PySpark jobs.

在您的情况下,miniconda3已随附1.4.初始化操作和作业以root身份执行.创建集群时,执行/etc/profile.d/effective-python.sh初始化Python环境.但是由于自定义映像脚本的顺序(首先)和(然后)可选组件激活顺序,因此miniconda3尚未在自定义映像构建时初始化,因此您的脚本实际上是自定义OS的系统Python,然后在集群创建期间,miniconda3进行了初始化.覆盖操作系统系统Python的Python.

In your case, 1.4 already comes with miniconda3. Init actions and jobs are executed as root. /etc/profile.d/effective-python.sh is executed to initialize the Python environment when creating the cluster. But due to the order of custom image script (first) and (then) optional component activation order, miniconda3 was not yet initialized at custom image build time, so your script actually customizes the OS's system Python, then during cluster creation time, miniconda3 initializes Python which overrides the OS's system Python.

我找到了一个解决方案,在您的自定义图像脚本中,在开始时添加以下代码,它将使您处于与作业相同的Python环境中:

I found a solution that, in your custom image script, add this code at the beginning, it will put you in the same Python environment as that of your jobs:

# This is /usr/bin/python
which python

# Activate miniconda3 optional component.
cat >>/etc/google-dataproc/dataproc.properties <<EOF
dataproc.components.activate=miniconda3
EOF
bash /usr/local/share/google/dataproc/bdutil/components/activate/miniconda3.sh
source /etc/profile.d/effective-python.sh

# Now this is /opt/conda/default/bin/python
which python

然后您可以安装软件包,例如:

then you could install packages, e.g.:

conda install <package> -y

这篇关于GCP Dataproc自定义图像Python环境的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 14:42