问题描述
到目前为止,我只在 Linux 机器和 VM(桥接网络)上运行 Spark,但现在我对使用更多计算机作为从机很感兴趣.在计算机上分发 Spark Slave Docker 容器并让它们自动将自己连接到硬编码的 Spark master ip 会很方便.这已经很短了,但我在从容器上配置正确的 SPARK_LOCAL_IP(或 start-slave.sh 的 --host 参数)时遇到问题.
So far I have run Spark only on Linux machines and VMs (bridged networking) but now I am interesting on utilizing more computers as slaves. It would be handy to distribute a Spark Slave Docker container on computers and having them automatically connecting themselves to a hard-coded Spark master ip. This short of works already but I am having trouble configuring the right SPARK_LOCAL_IP (or --host parameter for start-slave.sh) on slave containers.
我认为我正确配置了 SPARK_PUBLIC_DNS 环境变量以匹配主机的网络可访问 ip(来自 10.0.x.x 地址空间),至少它显示在 Spark 主 Web UI 上并且所有机器都可以访问.
I think I correctly configured the SPARK_PUBLIC_DNS env variable to match the host machine's network-accessible ip (from 10.0.x.x address space), at least it is shown on Spark master web UI and accessible by all machines.
我还按照 http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html,但在我的情况下,Spark master 在另一台机器上运行,而不是在 Docker 内部.我正在从网络中的另一台机器启动 Spark 作业,可能也运行一个从机本身.
I have also set SPARK_WORKER_OPTS and Docker port forwards as instructed at http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html, but in my case the Spark master is running on an other machine and not inside Docker. I am launching Spark jobs from an other machine within the network, possibly also running a slave itself.
我尝试过的事情:
- 根本不配置 SPARK_LOCAL_IP,slave 绑定到容器的 ip(如 172.17.0.45),无法从 master 或驱动程序连接,计算在大多数时间仍然有效,但并非总是如此
- 绑定到 0.0.0.0,slave 与 master 通信并建立一些连接但它死了,另一个 slave 出现并消失,它们继续像这样循环
- 绑定到主机 ip,启动失败,因为该 ip 在容器内不可见,但由于配置了端口转发,其他人可以访问
我想知道为什么在连接到从站时没有使用配置的 SPARK_PUBLIC_DNS?我以为 SPARK_LOCAL_IP 只会影响本地绑定,而不会透露给外部计算机.
I wonder why isn't the configured SPARK_PUBLIC_DNS being used when connecting to slaves? I thought SPARK_LOCAL_IP would only affect on local binding but not being revealed to external computers.
在 https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/connectivity_issues.html 他们指示将 SPARK_LOCAL_IP 设置为驱动程序、主进程和工作进程的集群可寻址主机名",这是唯一的选择吗?我会避免额外的 DNS 配置,只使用 ips 来配置计算机之间的流量.或者有什么简单的方法可以实现这一目标?
At https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/connectivity_issues.html they instruct to "set SPARK_LOCAL_IP to a cluster-addressable hostname for the driver, master, and worker processes", is this the only option? I would avoid the extra DNS configuration and just use ips to configure traffic between computers. Or is there an easy way to achieve this?
总结当前设置:
- Master 在 Linux 上运行(虚拟机在 Windows 上的 VirtualBox 上使用桥接网络)
- 驱动程序从其他 Windows 计算机提交作业,效果很好
- 用于启动从站的 Docker 映像作为已保存"的 .tar.gz 文件分发,加载(curl xyz | gunzip | docker load)并在网络中的其他机器上启动,具有带有私有/公共 ip 配置的探针
推荐答案
我想我找到了适合我的用例的解决方案(一个 Spark 容器/主机操作系统):
I think I found a solution for my use-case (one Spark container / host OS):
- 使用
--net host
和docker run
=> 主机的 eth0 在容器中可见 - 将
SPARK_PUBLIC_DNS
和SPARK_LOCAL_IP
设置为主机的ip,忽略docker0的172.x.x.x地址
- Use
--net host
withdocker run
=> host's eth0 is visible in the container - Set
SPARK_PUBLIC_DNS
andSPARK_LOCAL_IP
to host's ip, ignore the docker0's 172.x.x.x address
Spark 可以绑定到主机的 ip,其他机器也可以与之通信,端口转发负责其余的工作.不需要 DNS 或任何复杂的配置,我还没有彻底测试过这个,但到目前为止一切都很好.
Spark can bind to the host's ip and other machines communicate to it as well, port forwarding takes care of the rest. DNS or any complex configs were not needed, I haven't thoroughly tested this but so far so good.
请注意,这些说明适用于 Spark 1.x,在 Spark 2.x 中仅需要 SPARK_PUBLIC_DNS
,我认为 SPARK_LOCAL_IP
已弃用.
Note that these instructions are for Spark 1.x, at Spark 2.x only SPARK_PUBLIC_DNS
is required, I think SPARK_LOCAL_IP
is deprecated.
这篇关于Spark SPARK_PUBLIC_DNS 和 SPARK_LOCAL_IP 在带有 docker 容器的独立集群上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!