问题描述
我试图在kubernetes窗格中使用spark-submit
和client
模式将作业提交给EMR(由于其他一些基础设施问题,我们不允许使用cluster
模式).默认情况下,spark-submit
使用Pod的hostname
作为spark.driver.host
,而hostname
是Pod的主机名,因此spark executor
无法解析它.并且spark.driver.port
也是本地的容器(容器).
I am trying to use spark-submit
with client
mode in the kubernetes pod to submit jobs to EMR (Due to some other infra issues, we don't allow cluster
mode).By default, spark-submit
uses the hostname
of the pod as the spark.driver.host
and the hostname
is the pod's hostname so spark executor
could not resolve it. And the spark.driver.port
is also locally to the pod (container).
我知道一种将conf传递给spark-submit
的方法,以便spark executor
可以与driver
进行通信,这些配置是:
I know a way to pass some confs to spark-submit
so that the spark executor
can talk to the driver
, those configs are:
--conf spark.driver.bindAddress=0.0.0.0 --conf spark.driver.host=$HOST_IP_OF_K8S_WORKER --conf spark.driver.port=32000 --conf spark.driver.blockManager.port=32001
并在kubernetes中创建一个服务,以便spark executor
可以与driver
对话:
and create a service to in the kubernetes so that spark executor
can talk to the driver
:
apiVersion: v1
kind: Service
metadata:
name: spark-block-manager
namespace: my-app
spec:
selector:
app: my-app
type: NodePort
ports:
- name: port-0
nodePort: 32000
port: 32000
protocol: TCP
targetPort: 32000
- name: port-1
nodePort: 32001
port: 32001
protocol: TCP
targetPort: 32001
- name: port-2
nodePort: 32002
port: 32002
protocol: TCP
targetPort: 32002
但是问题是,一个k8s工人上可能有超过1个Pod运行,而一个Pod中甚至有1个以上的spark-submit
作业.因此,在启动Pod之前,我们需要在k8s节点中动态选择几个可用端口,并创建服务以进行端口映射,然后在启动Pod期间,将这些端口传递到Pod中以告知spark-submit
使用它们.我觉得这有点复杂.
But the issue is there are can be more than 1 pods running on one k8s worker and even more than 1 spark-submit
jobs in one pod. So before launching a pod, we need to dynamically select few available ports in the k8s node and create a service to do the port mapping and then during launching the pod, pass those ports into the pod to tell spark-submit
to use them. I feel like this is a little bit complex.
使用hostNetwork: true
可以潜在地解决此问题,但是它在我们的下文中引入了许多其他问题,因此这不是一个选择.
Using hostNetwork: true
could potentially solve this issue but it introduces lots of other issues in our infra so this is not an option.
如果spark-submit
可以像driver.bindAddress
和driver.host
一样支持bindPort
概念,或者支持proxy
,则解决该问题会更加干净.
If spark-submit
can support bindPort
concept just like driver.bindAddress
and driver.host
or support proxy
, it will be cleaner to solve the issue.
有人有类似情况吗?请分享一些见解.
Does someone have similar situation? Please share some insights.
谢谢.
其他上下文:spark version
:2.4
推荐答案
Spark提交可以使用其他参数,例如--conf spark.driver.bindAddress, --conf spark.driver.host, --conf spark.driver.port, --conf spark.driver.blockManager.port, --conf spark.port.maxRetries
. spark.driver.host
和driver.port
用于告诉Spark Executor使用此主机和端口连接回Spark提交.
Spark submit can take additional args like, --conf spark.driver.bindAddress, --conf spark.driver.host, --conf spark.driver.port, --conf spark.driver.blockManager.port, --conf spark.port.maxRetries
. The spark.driver.host
and driver.port
is used to tell Spark Executor to use this host and port to connect back to the Spark submit.
我们使用hostPort
和containerPort
暴露容器内的端口,将端口范围和hostIP
作为环境变量注入Pod,以便spark-submit知道要使用的内容.因此,这些额外的参数是:
We use hostPort
and containerPort
to expose the ports inside the container, inject the port range and hostIP
as the environment variables to the Pod so that spark-submit knows what to use. So those additional args are:
--conf spark.driver.bindAddress=0.0.0.0` # has to be `0.0.0.0` so that it is accessible outside pod
--conf spark.driver.host=$HOST_IP # k8s worker ip, can be easily injected to the pod
--conf spark.driver.port=$SPARK_DRIVER_PORT # defined as environment variable
--conf spark.driver.blockManager.port=$SPARK_DRIVER_PORT # defined as environment variable
--conf spark.port.maxRetries=$SPARK_PORT_MAX_RETRIES # defined as environment variable
hostPort
对于Kubernetes工作者而言是本地的,这意味着我们不必担心端口用尽. k8s调度程序可以找到一个主机来运行pod.
The hostPort
is local to the Kubernetes worker, which means we don’t need to worry about the run out of ports. The k8s scheduler can find a host to run the pod.
我们可以在每个主机上保留40000到49000之间的端口,并为每个Pod打开8个端口(因为每个spark-submit都需要2个打开的端口).端口是根据pod_id选择的.由于Kubernetes建议每个节点运行的Pod少于100个,因此端口冲突将非常罕见.
We can reserve the ports from 40000 to 49000 on each host, and open 8 ports for each pod (as each spark-submit requires 2 open ports). The ports are chosen based on the pod_id. Since Kubernetes recommends running less than 100 pods per node, the ports collision will be very rare.
这篇关于Spark/k8s:如何使用客户端模式在Kubernetes上运行Spark提交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!