问题描述
我们分发了使用Spark的Python应用程序以及Python 3.7解释器( python.exe
和所有必需的库位于 MyApp.exe
附近).
We distribute our Python app, which uses Spark, together with Python 3.7 interpreter (python.exe
with all necessary libs lies near MyApp.exe
).
要设置 PYSPARK_PYTHON
,我们具有确定 python.exe
路径的功能:
To set PYSPARK_PYTHON
we have have function which determines the path to our python.exe
:
os.environ['PYSPARK_PYTHON'] = get_python()
在Windows上
PYSPARK_PYTHON 将成为 C:/MyApp/python.exe
在Ubuntu上的 PYSPARK_PYTHON
将变为/opt/MyApp/python.exe
on Windows PYSPARK_PYTHON
will become C:/MyApp/python.exe
on Ubuntu PYSPARK_PYTHON
will become /opt/MyApp/python.exe
我们启动主/驱动程序节点并在Windows上创建 SparkSession
.然后,我们在Ubuntu上启动worker节点,但是worker失败:
We start the master/driver node and create SparkSession
on Windows. Then we start the worker node on Ubuntu but the worker fails with:
Job aborted due to stage failure: Task 1 in stage 11.0 failed 4 times, most recent failure: Lost task 1.3 in stage 11.0 (TID 1614, 10.0.2.15, executor 1): java.io.IOException: Cannot run program "C:/MyApp/python.exe": error=2, No such file or directory
当然,ubuntu上没有 C:/MyApp/python.exe
.
Of course, there is no C:/MyApp/python.exe
on ubuntu.
如果我正确理解此错误,则会将驱动程序中的 PYSPARK_PYTHON
发送给所有工作人员.
If I understand this error correctly, PYSPARK_PYTHON
from driver is sent to all workers.
还尝试在 spark-env.sh
和 spark-defaults.conf
中设置 PYSPARK_PYTHON
.如何在Ubuntu工作人员上将 PYSPARK_PYTHON
更改为/opt/MyApp/python.exe
?
Also tried to set PYSPARK_PYTHON
in spark-env.sh
and spark-defaults.conf
. How can I change PYSPARK_PYTHON
on Ubuntu workers to become /opt/MyApp/python.exe
?
推荐答案
浏览源代码,当创建运行Python函数的工作项时,Python驱动程序代码看起来像从其Spark上下文中放入了Python可执行路径的值.在 spark/rdd.py
中> :
Browsing through the souce code, it looks like the Python driver code puts the value of the Python executable path from its Spark context when creating work items for running Python functions in spark/rdd.py
:
def _wrap_function(sc, func, deserializer, serializer, profiler=None):
assert deserializer, "deserializer should not be empty"
assert serializer, "serializer should not be empty"
command = (func, profiler, deserializer, serializer)
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
^^^^^^^^^^^^^
sc.pythonVer, broadcast_vars, sc._javaAccumulator)
Python运行程序 PythonRunner.scala
然后使用存储在它收到的第一个工作项中的路径来启动新的解释器实例:
The Python runner PythonRunner.scala
then uses the path stored in the first work item it receives to launch new interpreter instances:
private[spark] abstract class BasePythonRunner[IN, OUT](
funcs: Seq[ChainedPythonFunctions],
evalType: Int,
argOffsets: Array[Array[Int]])
extends Logging {
...
protected val pythonExec: String = funcs.head.funcs.head.pythonExec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
def compute(
inputIterator: Iterator[IN],
partitionIndex: Int,
context: TaskContext): Iterator[OUT] = {
...
val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap)
...
}
...
}
基于此,我恐怕目前似乎无法在主服务器和工作服务器中为Python可执行文件提供单独的配置.另请参阅第三个注释,以发布 SPARK-26404 .也许您应该向Apache Spark项目提交RFE.
Based on that, I'm afraid that it seems not currently possible to have separate configurations for the Python executable in the master and in the workers. Also see the third comment to issue SPARK-26404. Perhaps you should file an RFE with the Apache Spark project.
虽然我不是Spark专家,但仍然可能有一种方法,可能是将 PYSPARK_PYTHON
设置为"python"
,然后确保系统 PATH
进行了相应配置,以便您的Python可执行文件排在首位.
I'm not a Spark guru though and there might still be a way to do it, perhaps by setting PYSPARK_PYTHON
to just "python"
and then making sure the system PATH
is configured accordingly so that your Python executable comes first.
这篇关于在Spark工作者上更改PYSPARK_PYTHON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!